[ 
https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878732#comment-16878732
 ] 

Lucene/Solr QA commented on SOLR-13532:
---------------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
26s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  3m  5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  3m  5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  3m  5s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 58m 
51s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 70m 30s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-13532 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12973620/SOLR-13532.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP 
Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 5bf6cf2 |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/479/testReport/ |
| modules | C: solr/core U: solr/core |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/479/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Unable to start core recovery due to timeout in ping request
> ------------------------------------------------------------
>
>                 Key: SOLR-13532
>                 URL: https://issues.apache.org/jira/browse/SOLR-13532
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 7.6
>            Reporter: Suril Shah
>            Priority: Major
>         Attachments: SOLR-13532.patch
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Discovered following issue with the core recovery:
>  * Core recovery is not being initialized and throwing following exception 
> message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO  
> (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr 
> x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 
> r:core_node2778) x:<collection_name>_shard41_replica_n2777 
> o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr 
> on recovery, try again{code}
>  * Above error occurs when ping request takes time more than a timeout period 
> which is hard-coded to one second in solr source code. However In a general 
> production setting it is common to have ping time more than one second, 
> hence, the core recovery never starts and exception is thrown.
>  * Also the other major concern is that this exception is logged as an info 
> message, hence it is very difficult to identify the error if info logging is 
> not enabled.
>  * Please refer to following code snippet from the [source 
> code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803]
>  to understand the above issue.
> {code:java}
>       try (HttpSolrClient httpSolrClient = new 
> HttpSolrClient.Builder(leaderReplica.getCoreUrl())
>           .withSocketTimeout(1000)
>           .withConnectionTimeout(1000)
>           
> .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
>           .build()) {
>         SolrPingResponse resp = httpSolrClient.ping();
>         return leaderReplica;
>       } catch (IOException e) {
>         log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
>         Thread.sleep(500);
>       } catch (Exception e) {
>         if (e.getCause() instanceof IOException) {
>           log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
>           Thread.sleep(500);
>         } else {
>           return leaderReplica;
>         }
>       }
> {code}
> The above issue will have high impact in production level clusters, since 
> cores not being able to recover may lead to data loss.
> Following improvements would be really helpful:
>  1. The [timeout for ping 
> request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
>  in *RecoveryStrategy.java* should be configurable and the defaults set to 
> high values like 15seconds.
>  2. The exception message in [line 
> 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
>  and [line 
> 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
>  in *RecoveryStrategy.java* should be logged as *error* messages instead of 
> *info* messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to