[ 
https://issues.apache.org/jira/browse/SOLR-5593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke updated SOLR-5593:
--------------------------------------

    Attachment: CoreAdminHandler.patch

Attaching one potential solution (we are investigating others):

As part of the recovery process state=recovering publishing already happens 
(RecoveryStrategy doRecovery) but only after a shard leader to recover from has 
been found. If the CoreAdminHandler handleRequestRecoveryAction publish had not 
happened then one of the followers should have been elected shard leader.


> shard leader loss due to ZK session expiry
> ------------------------------------------
>
>                 Key: SOLR-5593
>                 URL: https://issues.apache.org/jira/browse/SOLR-5593
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Christine Poerschke
>         Attachments: CoreAdminHandler.patch
>
>
> The problem we saw was that the shard leader ceased to be shard leader (in 
> our case due to its zookeeper session expiring). The followers thus rejected 
> update requests (DistributedUpdateProcessor setupRequest's call to 
> ZkStateReader getLeaderRetry) and the leader asked them to recover 
> (DistributedUpdateProcessor doFinish). The followers published themselves as 
> recovering (CoreAdminHandler handleRequestRecoveryAction) and the shard 
> leader loss triggered an election in which none of the followers became the 
> leader due to their recovering state (ShardLeaderElectionContext 
> shouldIBeLeader). The former shard leader also did not become shard leader 
> because its new seq number placed it after the existing replicas 
> (LeaderElector checkIfIamLeader seq <= intSeqs.get(0)).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to