[jira] [Commented] (SOLR-13532) Unable to start core recovery due to timeout in ping request

Hoss Man (JIRA) Wed, 03 Jul 2019 14:42:22 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878180#comment-16878180
 ]


Hoss Man commented on SOLR-13532:
---------------------------------

My first impression on seeing this patch was that I _really_ dislike the idea 
of "fixing" a hardcoded timeout by changing it to a _different_ hardcoded 
timeout – I would really much rather we use the existing {{solr.xml}} 
configured timeouts for this sort of thing.

So then I went poking around the code to refresh my memory about how/where the 
SO & CONNECT timeouts config options for intranode requests get populated in 
the code to propose an alternative patch that uses them, and realized that we 
already have an {{UpdateShardHandler.getRecoveryOnlyHttpClient()}} method that 
returns an HttpClient pre-configured with the correct timeout values ... and 
then I realized that this is already used in the code in question via 
{{withHttpClient(...)}}...
{code:java}
      // existing, pre-patch, code in RecoveryStrategy
      try (HttpSolrClient httpSolrClient = new 
HttpSolrClient.Builder(leaderReplica.getCoreUrl())
          .withSocketTimeout(1000)
          .withConnectionTimeout(1000)
          
.withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
{code}
This {{UpdateShardHandler.getRecoveryOnlyHttpClient()}} concept, and that 
corresponding {{withHttpClient()}} call, was introduced *after* the original 
recovery code was written (with those hardcoed timeouts) ... In theory if we 
just remove the {{withSocketTimeout}} and {{withConnectionTimeout}} completely 
from this class, then the cluster's {{solr.xml}} configuration options should 
start getting used.
----
But then I dug deeper and discovered that the way HttpSolrClient & it's Builder 
works is really silly and frustrating and causes the hardcoded values 
{{SolrClientBuilder.connectionTimeoutMillis = 15000}} and 
{{SolrClientBuilder.socketTimeoutMillis = 120000}} to get used at the request 
level, even when {{withHttpClient}} has been called to set an {{HttpClient}} 
that already has the settings we want ... basically defeating a huge part of 
the value in {{withHttpClient}} ... even using values of {{null}} or {{-1}} 
won't work, because of other nonsensical ways that "default" values come into 
play

I created SOLR-13605 to track the silliness in {{HttpClient.Builder}} – it's a 
bigger issue then just fixing this ping/recovery problem, and will require more 
careful consideration.

As much as it pains me to say this: I think that for now, for the purpose of 
fixing the bug in this jira, we should just remove the {{withSocketTimeout(}} 
and {{withConnectionTimeout()}} calls completely, and defer to the 
(pre-existing) hardcoded defaults in {{SolrClientBuilder}} ... at least that 
way we're reducing the number of hardcoded defaults in the code, and if/when 
SOLR-13605 get's fixed, the {{solr.xml}} settings should take affect.

The other alternative to this would be to update the {{RecoveryStrategy}} code 
to use something like {{cc.getConfig().getUpdateShardHandlerConfig()}} and then 
use {{UpdateShardHandlerConfig.getDistributedSocketTimeout()}} and 
{{UpdateShardHandlerConfig.getDistributedConnectionTimeout()}} to pass as the 
inputs to {{SolrHttpClient.Builder}} ... that seemed really silly and redundent 
when it first occured to me, but the more i think about it the more it's 
probably not that bad as a work around for SOLR-13605 until it's fixed.

What do folks think?

> Unable to start core recovery due to timeout in ping request
> ------------------------------------------------------------
>
>                 Key: SOLR-13532
>                 URL: https://issues.apache.org/jira/browse/SOLR-13532
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 7.6
>            Reporter: Suril Shah
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Discovered following issue with the core recovery:
>  * Core recovery is not being initialized and throwing following exception 
> message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO  
> (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr 
> x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 
> r:core_node2778) x:<collection_name>_shard41_replica_n2777 
> o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr 
> on recovery, try again{code}
>  * Above error occurs when ping request takes time more than a timeout period 
> which is hard-coded to one second in solr source code. However In a general 
> production setting it is common to have ping time more than one second, 
> hence, the core recovery never starts and exception is thrown.
>  * Also the other major concern is that this exception is logged as an info 
> message, hence it is very difficult to identify the error if info logging is 
> not enabled.
>  * Please refer to following code snippet from the [source 
> code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803]
>  to understand the above issue.
> {code:java}
>       try (HttpSolrClient httpSolrClient = new 
> HttpSolrClient.Builder(leaderReplica.getCoreUrl())
>           .withSocketTimeout(1000)
>           .withConnectionTimeout(1000)
>           
> .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
>           .build()) {
>         SolrPingResponse resp = httpSolrClient.ping();
>         return leaderReplica;
>       } catch (IOException e) {
>         log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
>         Thread.sleep(500);
>       } catch (Exception e) {
>         if (e.getCause() instanceof IOException) {
>           log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
>           Thread.sleep(500);
>         } else {
>           return leaderReplica;
>         }
>       }
> {code}
> The above issue will have high impact in production level clusters, since 
> cores not being able to recover may lead to data loss.
> Following improvements would be really helpful:
>  1. The [timeout for ping 
> request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
>  in *RecoveryStrategy.java* should be configurable and the defaults set to 
> high values like 15seconds.
>  2. The exception message in [line 
> 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
>  and [line 
> 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
>  in *RecoveryStrategy.java* should be logged as *error* messages instead of 
> *info* messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13532) Unable to start core recovery due to timeout in ping request

Reply via email to