[ https://issues.apache.org/jira/browse/SOLR-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gerlowski reassigned SOLR-13038: -------------------------------------- Assignee: (was: Jason Gerlowski) I hope to revisit this soon, but don't have time to focus on it in the immediate future. So I'm removing myself as the assignee. I still think this is an important issue to fix though, as it's a continuing contributor to test flakiness, as well as production behavior. > Overseer actions fail with NoHttpResponseException following a node restart > --------------------------------------------------------------------------- > > Key: SOLR-13038 > URL: https://issues.apache.org/jira/browse/SOLR-13038 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: master (8.0) > Reporter: Jason Gerlowski > Priority: Major > Attachments: SOLR-13038.patch > > > I noticed recently that a lot of overseer operations fail if they're executed > right after a restart of a Solr node. The failure returns a message like > "org.apache.solr.client.solrj.SolrServerException:IOException occured when > talking to server at: https://127.0.0.1:62253/solr". The logs are a bit more > helpful: > {code} > org.apache.solr.client.solrj.SolrServerException: IOException occured when > talking to server at: https://127.0.0.1:62253/solr > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:657) > ~[java/:?] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) > ~[java/:?] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) > ~[java/:?] > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) > ~[java/:?] > at > org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172) > ~[java/:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_172] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172] > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) > ~[metrics-core-3.2.6.jar:3.2.6] > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > ~[java/:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_172] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_172] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] > Caused by: org.apache.http.NoHttpResponseException: 127.0.0.1:62253 failed to > respond > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120) > ~[java/:?] > at > org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) > ~[httpclient-4.5.6.jar:4.5.6] > at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542) > ~[java/:?] > ... 12 more > {code} > After a bit of debugging I was able to confirm the problem: when some > non-overseer node gets restarted, the overseer never notices that its > connections are invalid and will try to reuse them for subsequent requests > that happen right after the restart. > There's a few ways we might be able to tackle this: > * we could look at adding logic to {{SolrHttpRequestRetryHandler}} to retry > when this happens. SHRRH already retries NoHttpResponseException generally, > but has other logic which prevents any retries on collection/core-admin APIs. > Maybe we could elaborate this a bit. > * we could add retry logic to the {{HttpShardHandler}} code that makes these > requests. We could do this across the board, or more selectively for only > the overseer commands that are "retry-able". > * We could tweak how our connection pool is managed so that it evicts these > idle connections more aggressively. It seems like something similar has > already been tried (without success) on SOLR-6944 > Not sure what the right approach is. Seems like intermittent > NoHttpResponseExceptions have been a problem in Solr (and its tests) going > back at least 5 years or so. Several JIRAs suggested adding retries for NHRE > in the past but have been killed since not all APIs are idempotent and other > JIRAs have been concerned with fixing this at the (very broad) SolrClient > level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org