[ https://issues.apache.org/jira/browse/SOLR-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-16992: -------------------------------------- Attachment: OUTPUT-org.apache.solr.client.solrj.io.stream.StreamingTest.txt thetaphi_solr_Solr-main-Linux_14679.log.txt Status: Open (was: Open) I'm attaching two files: a typical jenkins job failure of {{StreamingTest}} ({_}before{_} SOLR-16983) and the non-reproducable failure i was able to locally coax out of branch_9x ({_}after{_} SOLR-16983) Here's the important snippet from the output of the local failure... {noformat} > java.lang.AssertionError: ObjectTracker found 1 object(s) that were not released!!! [Input] > org.eclipse.jetty.client.util.InputStreamResponseListener$Input:org.apache.solr.common.util.ObjectReleaseTracker$ObjectTrackerException: org.eclipse.jetty.client.u til.InputStreamResponseListener$Input > at org.apache.solr.common.util.ObjectReleaseTracker.track(ObjectReleaseTracker.java:52) > at org.apache.solr.client.solrj.impl.Http2SolrClient$InputStreamReleaseTrackingResponseListener$ObjectReleaseTrackedInputStream.<init>(Http2SolrClient.java: 1495) > at org.apache.solr.client.solrj.impl.Http2SolrClient$InputStreamReleaseTrackingResponseListener.getInputStream(Http2SolrClient.java:1489) > at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:532) > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1192) > at org.apache.solr.client.solrj.io.stream.SolrStream.constructParser(SolrStream.java:304) > at org.apache.solr.client.solrj.io.stream.SolrStream.open(SolrStream.java:129) > at org.apache.solr.client.solrj.io.stream.CloudSolrStream$StreamOpener.call(CloudSolrStream.java:539) > at org.apache.solr.client.solrj.io.stream.CloudSolrStream$StreamOpener.call(CloudSolrStream.java:527) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:289) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.lang.Exception: Submitter stack trace > at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:256) > at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140) > at org.apache.solr.client.solrj.io.stream.CloudSolrStream.lambda$openStreams$8(CloudSolrStream.java:419) > at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) > at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > at org.apache.solr.client.solrj.io.stream.CloudSolrStream.openStreams(CloudSolrStream.java:420) > at org.apache.solr.client.solrj.io.stream.CloudSolrStream.open(CloudSolrStream.java:292) > at org.apache.solr.client.solrj.io.stream.ExceptionStream.open(ExceptionStream.java:52) > at org.apache.solr.client.solrj.io.stream.StreamingTest.getTuple(StreamingTest.java:2895) > at org.apache.solr.client.solrj.io.stream.StreamingTest.testExceptionStream(StreamingTest.java:662) {noformat} Here is roughly what i think is happening... * {{testExceptionStream()}} calls {{getTuple(estream)}} calls {{ExceptionStream.open()}} calls {{CloudSolrStream.open()}} * {{CloudSolrStream.open()}} initializes {{List<TupleStream> solrStreams}} containing one {{SolrStream}} per shard and calls {{CloudSolrStream.openStreams()}} * {{CloudSolrStream.openStreams()}} ... ** spins up an {{ExecutorService servce = ExecutorUtil.newMDCAwareCachedThreadPool(new SolrNamedThreadFactory("CloudSolrStream"))}} ** calls {{service.submit(new StreamOpener(...))}} on every instance in {{solrStreams}} and records the {{List<Future<TupleWrapper>> futures}} ** loops over the {{futures}} ... *** If any throw an (Execution) exception: immediately wrap and rethrow that exception ** call {{servce.shutdown()}} in a {{finally}} block *** *This is not a blocking call!* *** It does not terminate (or wait for completion) of any {{StreamOpener}} instances currently being executed by ThreadPool threads. * {{ExceptionStream.open()}} will catch the exception thrown by {{CloudSolrStream.open()}} and store it in a Tuple * {{getTuple(estream)}} gets the Tuple from {{ExceptionStream.read()}} and calls {{ExceptionStream.close()}} calls {{CloudStream.close()}} * At this point {{getTuple(estream)}} returns the (expected) Tuple from {{ExceptionStream}} and the test can make assertions on them and pass ** *BUT MEANWHILE* ... ** _after_ the test thread has (cascadingly) called {{{}CloudStream.close(){}}}, threads from the {{ExecutorService}} may still be calling {{SolrStream.open()}} on the instances in {{CloudStream.solrStreams}} *** Due to the way the {{SolrClientCache}} works, These {{SolrStream.open()}} calls can/will cause new {{Http2SolrClient}} instances to be created **** These {{Http2SolrClient}} instances may be opened _after_ the test thread has called {{SolrClientCache.close()}} ... causing them to leak *** These {{SolrStream.open()}} calls can/will cause new HTTP Connections to be initiated, opening new {{InputStream}} instances **** These {{InputStream}} instances will never be closed, because the {{SolrStream}} instances will never be "re"-closed. ---- I think at a minimum, {{CloudSolrStream.openStreams()}} should be changed such that... * It waits for the results of _every_ {{{}Future{}}}, even if one of them throws an exception ** The {{IOException}} it ultimate throws should wrap one of these caught exceptions, and call {{addSuppressed}} on all the rest *** This seems like a cleaner/better behavior for the exception handling in {{{}CloudSolrStream.openStreams(){}}}, independent of the concurrency race conditions being discussed *** A key benefit of this, is that it ensures all the {{StreamOpener}} instances will have completed before {{CloudSolrStream.openStreams()}} returns – ensuring that no background threads can/will muck with the internals of {{CloudSolrStream}} while the caller has moved on to calling {{CloudSolrStream.close()}} * it might make sense to instead use {{shutdownNow() + awaitTermination()}} if/when any {{Future}} fails ** but that would definitely be more complicated to implement (let alone implement correctly) ---- Going above and beyond that, it's worth considering: * update many {{Closeable}} classes in solr to keep track of if/when {{close}} is called more then once * ** ie: a new {{private transient boolean isClosed = false;}} that {{close()}} sets to {{true}} * In the specific case of streaming classes discussed in this jira... ** Change {{SolrClientCache}} so that any method that will add to {{solrClients}} throws an {{IllegalStateException}} if {{isClosed}} ** make all {{TupleStream}} instances throw an {{IllegalStateException}} from {{open()}} and/or {{read()}} if {{isClosed}} * In the general case, maybe the {{close()}} method of many solr {{Closeable}} classes should... ** log a "WARN" if {{isClosed}} is already true ** {{assert ! isClosed : "already closed";}} so we trip tests if this happens ** (Neither of these are strictly idempotent, but I think the end results would be better then perfect idempotent-ness) /cc [~stillalex] > Non-reproducible StreamingTest failures -- suggests CloudSolrStream > concurency race condition > --------------------------------------------------------------------------------------------- > > Key: SOLR-16992 > URL: https://issues.apache.org/jira/browse/SOLR-16992 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Chris M. Hostetter > Priority: Major > Attachments: > OUTPUT-org.apache.solr.client.solrj.io.stream.StreamingTest.txt, > thetaphi_solr_Solr-main-Linux_14679.log.txt > > > Roughly 3% of all jenkins jobs that run {{StreamingTest}} wind up having > suite level failures. > These failures have historically taken the form of > {{com.carrotsearch.randomizedtesting.ThreadLeakError}} and the leaked threads > all have names like > {{"h2sc-718-thread-2"}} indicating that they come from the internal > {{ExecutorService}} of an {{{}Http2SolrClient{}}}. > In my experience, the seeds from these failures have never reproduced - > suggesting that the problem is related to concurrency. > SOLR-16983 restored the (correct) use of {{ObjectReleaseTracker}} which in > theory should help pinpoint where {{Http2SolrClient}} instances might not be > getting closed (by causing {{ObjectReleaseTracker}} to fail with stacktraces > of when/where any unclosed instances were created - ie: which test method) > In practice, I have managed to force one failure from {{StreamingTest}} since > the SOLR-16983 changes (logs to be attached soon) - but it still didn't > indicate any leaked/unclosed {{Http2SolrClient}} instances. What it instead > indicated was a _single_ unclosed {{InputStream}} instance related to > {{Http2SolrClient}} connections (SOLR-16983 also added better tracking of > this) coming from {{StreamingTest.testExceptionStream}} - a test method that > opens _five_ very similar {{ExceptionStream}} instances, wrapping > {{CloudSolrStream}} instance, which expect to trigger server side errors. > By it's very design, {{ExceptionStream}} catches & records any exceptions > from the stream it wraps, so even in the event of these "expected" server > side errors, {{ExceptionStream.close()}} should still be correctly getting > called (and propagating down to the {{CloudStream}} it wraps). > I believe the underlying problem has to do with a concurrency race condition > between the call to {{CloudStream.close()}} and the {{ExecutorService}} used > internally by {{CloudSolrStream.openStreams()}} (details to follow) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org