Shalin Shekhar Mangar created SOLR-10914:
--------------------------------------------
Summary: RecoveryStrategy's sendPrepRecoveryCmd can get stuck for
5 minutes if leader is unloaded
Key: SOLR-10914
URL: https://issues.apache.org/jira/browse/SOLR-10914
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: SolrCloud
Affects Versions: 6.6, 6.5, 6.4
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
Fix For: master (7.0)
tl;dr; a recovering replica is stuck for 5 minutes in the prep recovery request
if the leader core is unloaded before the prep recovery request is made.
SOLR-9716 changed the sendPrepRecoveryCmd to retry on read timeouts (earlier it
had no connection/read timeout at all) but the fix has caused another problem.
Say
# A replica starts up (or is newly created) and goes into recovery,
# Replica finds that leader=X
# The core X is unloaded but the node that used to host X is still running and
taking requests
# Replica calls sendPrepRecoveryCmd to X
At this point, the node X receives the prep recovery command, finds that the
core X does not exist and keeps checking again in a sleep-loop until a timeout
happens. I am not sure why prep recovery core admin command needs to continue
waiting if a local core does not exist. The default timeout here is usually
longer than 10 seconds.
On the recovering replica's side, the prep recovery has a connection/read
timeout of only 10s, so the request always times out and is retried upto 5
minutes. Only then does the recovery attempt fails and may be restarted again
with the right leader URL.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]