[ https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859330#comment-16859330 ]
Gus Heck commented on SOLR-13532: --------------------------------- Hard coded timeouts are a general problem in several areas. I opened SOLR-13457 a little while ago to track these sorts of problems. > Unable to start core recovery due to timeout in ping request > ------------------------------------------------------------ > > Key: SOLR-13532 > URL: https://issues.apache.org/jira/browse/SOLR-13532 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 7.6 > Reporter: Suril Shah > Priority: Major > > Discovered following issue with the core recovery: > * Core recovery is not being initialized and throwing following exception > message : > {code:java} > 2019-06-07 00:53:12.436 INFO > (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr > x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 > r:core_node2778) x:<collection_name>_shard41_replica_n2777 > o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr > on recovery, try again{code} > * Above error occurs when ping request takes time more than a timeout period > which is hard-coded to one second in solr source code. However In a general > production setting it is common to have ping time more than one second, > hence, the core recovery never starts and exception is thrown. > * Also the other major concern is that this exception is logged as an info > message, hence it is very difficult to identify the error if info logging is > not enabled. > * Please refer to following code snippet from the [source > code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803] > to understand the above issue. > {code:java} > try (HttpSolrClient httpSolrClient = new > HttpSolrClient.Builder(leaderReplica.getCoreUrl()) > .withSocketTimeout(1000) > .withConnectionTimeout(1000) > > .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient()) > .build()) { > SolrPingResponse resp = httpSolrClient.ping(); > return leaderReplica; > } catch (IOException e) { > log.info("Failed to connect leader {} on recovery, try again", > leaderReplica.getBaseUrl()); > Thread.sleep(500); > } catch (Exception e) { > if (e.getCause() instanceof IOException) { > log.info("Failed to connect leader {} on recovery, try again", > leaderReplica.getBaseUrl()); > Thread.sleep(500); > } else { > return leaderReplica; > } > } > {code} > The above issue will have high impact in production level clusters, since > cores not being able to recover may lead to data loss. > Following improvements would be really helpful: > 1. The [timeout for ping > request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791] > in *RecoveryStrategy.java* should be configurable and the defaults set to > high values like 15seconds. > 2. The exception message in [line > 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797] > and [line > 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801] > in *RecoveryStrategy.java* should be logged as *error* messages instead of > *info* messages -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org