[ 
https://issues.apache.org/jira/browse/SOLR-16099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506622#comment-17506622
 ] 

Chris M. Hostetter commented on SOLR-16099:
-------------------------------------------

I'm filing this issue for tracking purposes and to raise visibility of the 
symptoms and workaround - I don't know if/how it can be "fixed" in Solr.

This popped up for me when dealing with a customer who noticed they were 
running out of disk on nodes hosting PULL replicas of large collection that 
were getting lots of document updates. The hung threads were holding 
{{SolrIndexSearcher}} references, which had open file handles to segments that 
were no longer in use by the "current" index but couldn't be overwritten by the 
filesystem because of the open file handles).

This problem could manifest in other ways for other people – the most likely 
(beyond the obvious failure or "slow" request metrics) being high memory usage 
as these threads and the object they hold references to taking up Heap space 
indefinitely.

 
----
 

While investigating open jetty issues that might pertain to this, I discovered 
that Ishan & Noble have apparently [been aware of this 
problem|https://github.com/fullstorydev/lucene-solr/pull/121] [for a 
while|https://github.com/fullstorydev/lucene-solr/pull/106], and already have 
[an open ticket with the Jetty project 
(#7192)|https://github.com/eclipse/jetty.project/issues/7192] regarding these 
symptoms.

(Ishan alluded to this problem in SOLR-15840, but that Jira is a general "Task" 
regarding questions about the performance of HTTP2 in Solr, that doesn't 
explicitly call out that this "Bug" exists and can hang threads - hence I've 
opened this new Jira.)

So far the only known [steps to reliably 
reproduce|https://github.com/eclipse/jetty.project/issues/7192#issuecomment-989727192]
 are fairly onerous (they crashed my machine while attempting to run them) 
which makes it hard to easily identify a root cause, but the [latest analysis 
from jetty 
dev|https://github.com/eclipse/jetty.project/issues/7192#issuecomment-991068558]
 {{sbordet}} is...
{quote}We were able to get a failure, so we instrumented the client to 
understand why it was waiting, and found that it was waiting for a request 
whose headers arrived fine (with a 200 status code) but the client was waiting 
for the response body (no errors).

We tried to track this request on the server logs, but we could not find any 
information.
Any attempt to increment the log level to DEBUG makes the failure 
non-reproducible, so we are a bit stuck trying to understand what's going on.

>From our point of view, the client received response headers and it's waiting 
>for the content, but it never arrives.
We cannot explain how the response headers arrived, but not the content, but we 
don't have information about the server – whether there was an error on the 
server or similar, because the logs don't show any error.

At this point I'm throwing the ball in your court: are you sure that you are 
sending the response content?

...

If we can track the tcpdump and see that the HTTP/2 frames reach the client, 
then it's a Jetty client problem;
otherwise the server did not send the HTTP/2 frames and this could be either 
for a Jetty server problem, or because the application did not send any data.
In this way we're trying to restrict the problem.
{quote}
The issue seems to have stalled out ~2 months ago while attempting to provide 
usable {{tcpdump}} trace of the affected requests.

[~ichattopadhyaya] / [~noblepaul] : Do you have any more info on this that you 
can share? If you can reliably reproduce this can you update the jetty ticket 
w/usable {{tcpdump}} logs?

 
----
 

One thing we may want to consider (in Solr) is replacing our usage of 
{{InputStreamResponseListener}} with a variant implementation that uses a 
"timeout" instead of an unlimited {{wait()}} (along the lines of a [spin-off 
jetty enhancement issue|https://github.com/eclipse/jetty.project/issues/7259] 
one of the jetty devs filed). We could probably (with some effort) tweak the 
impacted Solr APIs to propogate the (remaining) {{timeAllowed}} (if that option 
was specified) down to this class – and/or have an "extreme" default (ie: 
30min) just to prevent threads from sticking around forever.

> HTTP Client threads can hang in Jetty's InputStreamResponseListener when 
> using HTTP2 - impacts intra-node communication
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-16099
>                 URL: https://issues.apache.org/jira/browse/SOLR-16099
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud, SolrJ
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> There apearrs to be a Jetty HttpClient bug that makes it possible for a 
> request thread to hang indefinitely while waiting to parse the response from 
> remote jetty servers. The cause of the hung thread is because it calls 
> {{.wait()}} on monitor lock that _should_ be notified by another (internal 
> jetty client) thread when a chunk of data is available from the wire – but in 
> some cases this evidently may not happen.
> In the case of {{distrib=true}} requests processed by Solr (aggregating 
> multiple per-shard responses from other nodes) this can manifest with stack 
> traces that look like the following (taken from Solr 8.8.2)...
> {noformat}
>  "thread",{
>    "id":14253,
>    "name":"httpShardExecutor-7-thread-13819-...",
>    "state":"WAITING",
>    
> "lock":"org.eclipse.jetty.client.util.InputStreamResponseListener@12b59075",
>    "lock-waiting":{
>      
> "name":"org.eclipse.jetty.client.util.InputStreamResponseListener@12b59075",
>      "owner":null},
>    
> "synchronizers-locked":["java.util.concurrent.ThreadPoolExecutor$Worker@1ec1aed0"],
>    "cpuTime":"65.4882ms",
>    "userTime":"60.0000ms",
>    "stackTrace":["java.base@11.0.14/java.lang.Object.wait(Native Method)",
>      "java.base@11.0.14/java.lang.Object.wait(Unknown Source)",
>      
> "org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318)",
>      
> "org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:90)",
>      
> "org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:99)",
>      
> "org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:217)",
>      "org.apache.solr.common.util.JavaBinCodec._init(JavaBinCodec.java:211)",
>      
> "org.apache.solr.common.util.JavaBinCodec.initRead(JavaBinCodec.java:202)",
>      
> "org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:195)",
>      
> "org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:51)",
>      
> "org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:696)",
>      
> "org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:412)",
>      
> "org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:761)",
>      
> "org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:369)",
>      
> "org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:297)",
>      
> "org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:371)",
>      
> "org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:132)",
>      
> "org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:41)",
>      "java.base@11.0.14/java.util.concurrent.FutureTask.run(Unknown Source)",
>      
> "java.base@11.0.14/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>  Source)",
>      "java.base@11.0.14/java.util.concurrent.FutureTask.run(Unknown Source)",
>      
> "com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)",
>      
> "org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)",
>      
> "org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$175/0x0000000840243c40.run(Unknown
>  Source)",
>      
> "java.base@11.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)",
>      
> "java.base@11.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)",
>      "java.base@11.0.14/java.lang.Thread.run(Unknown Source)"]},
> {noformat}
> ...these {{httpShardExecutor}} threads can stay hung, tying up system 
> resources, indefinitely (unless they get a spuriuos {{notify()}} from the 
> JVM). (In once case, it seems to have caused a request to hang for {*}10.3 
> hours{*})
> Anecdotally:
>  * There is some evidence that this problem did _*NOT*_ affect Solr 8.6.3, 
> but does affect later versions
>  ** suggesting the bug didn't exist in Jetty until _after_ 9.4.27.v20200227
>  * Forcing the Jetty HttpClient to use HTTP1.1 transport seems to prevent 
> this problem from happening
>  ** In Solr this can be done by setting the {{"solr.http1"}} system property
>  ** Or using the {{Http2SolrClient.Builder.useHttp1_1()}} method in client 
> application code



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to