[ https://issues.apache.org/jira/browse/SOLR-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072516#comment-16072516 ]
ASF subversion and git services commented on SOLR-10914: -------------------------------------------------------- Commit df727d313f6f63f73b8efe0a0448b263581670bd in lucene-solr's branch refs/heads/branch_6x from [~shalinmangar] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=df727d3 ] SOLR-10914: RecoveryStrategy's sendPrepRecoveryCmd can get stuck for 5 minutes if leader is unloaded (cherry picked from commit 157ff9a) > RecoveryStrategy's sendPrepRecoveryCmd can get stuck for 5 minutes if leader > is unloaded > ---------------------------------------------------------------------------------------- > > Key: SOLR-10914 > URL: https://issues.apache.org/jira/browse/SOLR-10914 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 6.4, 6.5, 6.6 > Reporter: Shalin Shekhar Mangar > Assignee: Shalin Shekhar Mangar > Fix For: master (7.0), 6.7 > > Attachments: SOLR-10914.patch, SOLR-10914.patch, SOLR-10914.patch > > > tl;dr; a recovering replica is stuck for 5 minutes in the prep recovery > request if the leader core is unloaded before the prep recovery request is > made. > SOLR-9716 changed the sendPrepRecoveryCmd to retry on read timeouts (earlier > it had no connection/read timeout at all) but the fix has caused another > problem. Say > # A replica starts up (or is newly created) and goes into recovery, > # Replica finds that leader=X > # The core X is unloaded but the node that used to host X is still running > and taking requests > # Replica calls sendPrepRecoveryCmd to X > At this point, the node X receives the prep recovery command, finds that the > core X does not exist and keeps checking again in a sleep-loop until a > timeout happens. I am not sure why prep recovery core admin command needs to > continue waiting if a local core does not exist. The default timeout here is > usually longer than 10 seconds. > On the recovering replica's side, the prep recovery has a connection/read > timeout of only 10s, so the request always times out and is retried upto 5 > minutes. Only then does the recovery attempt fails and may be restarted again > with the right leader URL. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org