[ 
https://issues.apache.org/jira/browse/SOLR-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-16992:
--------------------------------------
    Attachment: OUTPUT-org.apache.solr.client.solrj.io.stream.StreamingTest.txt
                thetaphi_solr_Solr-main-Linux_14679.log.txt
        Status: Open  (was: Open)

I'm attaching two files: a typical jenkins job failure of {{StreamingTest}} 
({_}before{_} SOLR-16983) and the non-reproducable failure i was able to 
locally coax out of branch_9x ({_}after{_} SOLR-16983)

Here's the important snippet from the output of the local failure...
{noformat}
   >     java.lang.AssertionError: ObjectTracker found 1 object(s) that were 
not released!!! [Input]
   >     
org.eclipse.jetty.client.util.InputStreamResponseListener$Input:org.apache.solr.common.util.ObjectReleaseTracker$ObjectTrackerException:
 org.eclipse.jetty.client.u
til.InputStreamResponseListener$Input
   >            at 
org.apache.solr.common.util.ObjectReleaseTracker.track(ObjectReleaseTracker.java:52)
   >            at 
org.apache.solr.client.solrj.impl.Http2SolrClient$InputStreamReleaseTrackingResponseListener$ObjectReleaseTrackedInputStream.<init>(Http2SolrClient.java:
1495)
   >            at 
org.apache.solr.client.solrj.impl.Http2SolrClient$InputStreamReleaseTrackingResponseListener.getInputStream(Http2SolrClient.java:1489)
   >            at 
org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:532)
   >            at 
org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1192)
   >            at 
org.apache.solr.client.solrj.io.stream.SolrStream.constructParser(SolrStream.java:304)
   >            at 
org.apache.solr.client.solrj.io.stream.SolrStream.open(SolrStream.java:129)
   >            at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream$StreamOpener.call(CloudSolrStream.java:539)
   >            at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream$StreamOpener.call(CloudSolrStream.java:527)
   >            at 
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   >            at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:289)
   >            at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   >            at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   >            at java.base/java.lang.Thread.run(Thread.java:829)
   >     Caused by: java.lang.Exception: Submitter stack trace
   >            at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:256)
   >            at 
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
   >            at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream.lambda$openStreams$8(CloudSolrStream.java:419)
   >            at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
   >            at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
   >            at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
   >            at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
   >            at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
   >            at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
   >            at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
   >            at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream.openStreams(CloudSolrStream.java:420)
   >            at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream.open(CloudSolrStream.java:292)
   >            at 
org.apache.solr.client.solrj.io.stream.ExceptionStream.open(ExceptionStream.java:52)
   >            at 
org.apache.solr.client.solrj.io.stream.StreamingTest.getTuple(StreamingTest.java:2895)
   >            at 
org.apache.solr.client.solrj.io.stream.StreamingTest.testExceptionStream(StreamingTest.java:662)
{noformat}
Here is roughly what i think is happening...
 * {{testExceptionStream()}} calls {{getTuple(estream)}} calls 
{{ExceptionStream.open()}} calls {{CloudSolrStream.open()}}
 * {{CloudSolrStream.open()}} initializes {{List<TupleStream> solrStreams}} 
containing one {{SolrStream}} per shard and calls 
{{CloudSolrStream.openStreams()}}
 * {{CloudSolrStream.openStreams()}} ...
 ** spins up an {{ExecutorService servce = 
ExecutorUtil.newMDCAwareCachedThreadPool(new 
SolrNamedThreadFactory("CloudSolrStream"))}}
 ** calls {{service.submit(new StreamOpener(...))}} on every instance in 
{{solrStreams}} and records the {{List<Future<TupleWrapper>> futures}}
 ** loops over the {{futures}} ...
 *** If any throw an (Execution) exception: immediately wrap and rethrow that 
exception
 ** call {{servce.shutdown()}} in a {{finally}} block
 *** *This is not a blocking call!*
 *** It does not terminate (or wait for completion) of any {{StreamOpener}} 
instances currently being executed by ThreadPool threads.
 * {{ExceptionStream.open()}} will catch the exception thrown by 
{{CloudSolrStream.open()}} and store it in a Tuple
 * {{getTuple(estream)}} gets the Tuple from {{ExceptionStream.read()}} and 
calls {{ExceptionStream.close()}} calls {{CloudStream.close()}}
 * At this point {{getTuple(estream)}} returns the (expected) Tuple from 
{{ExceptionStream}} and the test can make assertions on them and pass
 ** *BUT MEANWHILE* ...
 ** _after_ the test thread has (cascadingly) called 
{{{}CloudStream.close(){}}}, threads from the {{ExecutorService}} may still be 
calling {{SolrStream.open()}} on the instances in {{CloudStream.solrStreams}}
 *** Due to the way the {{SolrClientCache}} works, These {{SolrStream.open()}} 
calls can/will cause new {{Http2SolrClient}} instances to be created
 **** These {{Http2SolrClient}} instances may be opened _after_ the test thread 
has called {{SolrClientCache.close()}} ... causing them to leak
 *** These {{SolrStream.open()}} calls can/will cause new HTTP Connections to 
be initiated, opening new {{InputStream}} instances
 **** These {{InputStream}} instances will never be closed, because the 
{{SolrStream}} instances will never be "re"-closed.

----
I think at a minimum, {{CloudSolrStream.openStreams()}} should be changed such 
that...
 * It waits for the results of _every_ {{{}Future{}}}, even if one of them 
throws an exception
 ** The {{IOException}} it ultimate throws should wrap one of these caught 
exceptions, and call {{addSuppressed}} on all the rest
 *** This seems like a cleaner/better behavior for the exception handling in 
{{{}CloudSolrStream.openStreams(){}}}, independent of the concurrency race 
conditions being discussed
 *** A key benefit of this, is that it ensures all the {{StreamOpener}} 
instances will have completed before {{CloudSolrStream.openStreams()}} returns 
– ensuring that no background threads can/will muck with the internals of 
{{CloudSolrStream}} while the caller has moved on to calling 
{{CloudSolrStream.close()}}
 * it might make sense to instead use {{shutdownNow() + awaitTermination()}} 
if/when any {{Future}} fails
 ** but that would definitely be more complicated to implement (let alone 
implement correctly)

----
Going above and beyond that, it's worth considering:
 * update many {{Closeable}} classes in solr to keep track of if/when {{close}} 
is called more then once

 * 
 ** ie: a new {{private transient boolean isClosed = false;}} that {{close()}} 
sets to {{true}}
 * In the specific case of streaming classes discussed in this jira...
 ** Change {{SolrClientCache}} so that any method that will add to 
{{solrClients}} throws an {{IllegalStateException}} if {{isClosed}}
 ** make all {{TupleStream}} instances throw an {{IllegalStateException}} from 
{{open()}} and/or {{read()}} if {{isClosed}}
 * In the general case, maybe the {{close()}} method of many solr {{Closeable}} 
classes should...
 ** log a "WARN" if {{isClosed}} is already true
 ** {{assert ! isClosed : "already closed";}} so we trip tests if this happens
 ** (Neither of these are strictly idempotent, but I think the end results 
would be better then perfect idempotent-ness)

 

/cc [~stillalex] 

> Non-reproducible StreamingTest failures -- suggests CloudSolrStream 
> concurency race condition
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-16992
>                 URL: https://issues.apache.org/jira/browse/SOLR-16992
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: 
> OUTPUT-org.apache.solr.client.solrj.io.stream.StreamingTest.txt, 
> thetaphi_solr_Solr-main-Linux_14679.log.txt
>
>
> Roughly 3% of all jenkins jobs that run {{StreamingTest}} wind up having 
> suite level failures.
> These failures have historically taken the form of 
> {{com.carrotsearch.randomizedtesting.ThreadLeakError}} and the leaked threads 
> all have names like
> {{"h2sc-718-thread-2"}} indicating that they come from the internal 
> {{ExecutorService}} of an {{{}Http2SolrClient{}}}.
> In my experience, the seeds from these failures have never reproduced - 
> suggesting that the problem is related to concurrency.
> SOLR-16983 restored the (correct) use of {{ObjectReleaseTracker}} which in 
> theory should help pinpoint where {{Http2SolrClient}} instances might not be 
> getting closed (by causing {{ObjectReleaseTracker}} to fail with stacktraces 
> of when/where any unclosed instances were created - ie: which test method)
> In practice, I have managed to force one failure from {{StreamingTest}} since 
> the SOLR-16983 changes (logs to be attached soon) - but it still didn't 
> indicate any leaked/unclosed {{Http2SolrClient}} instances. What it instead 
> indicated was a _single_ unclosed {{InputStream}} instance related to 
> {{Http2SolrClient}} connections (SOLR-16983 also added better tracking of 
> this) coming from {{StreamingTest.testExceptionStream}} - a test method that 
> opens _five_ very similar {{ExceptionStream}} instances, wrapping 
> {{CloudSolrStream}} instance, which expect to trigger server side errors.
> By it's very design, {{ExceptionStream}} catches & records any exceptions 
> from the stream it wraps, so even in the event of these "expected" server 
> side errors, {{ExceptionStream.close()}} should still be correctly getting 
> called (and propagating down to the {{CloudStream}} it wraps).
> I believe the underlying problem has to do with a concurrency race condition 
> between the call to {{CloudStream.close()}} and the {{ExecutorService}} used 
> internally by {{CloudSolrStream.openStreams()}} (details to follow)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to