[ https://issues.apache.org/jira/browse/SOLR-5593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christine Poerschke updated SOLR-5593: -------------------------------------- Attachment: CoreAdminHandler.patch Attaching one potential solution (we are investigating others): As part of the recovery process state=recovering publishing already happens (RecoveryStrategy doRecovery) but only after a shard leader to recover from has been found. If the CoreAdminHandler handleRequestRecoveryAction publish had not happened then one of the followers should have been elected shard leader. > shard leader loss due to ZK session expiry > ------------------------------------------ > > Key: SOLR-5593 > URL: https://issues.apache.org/jira/browse/SOLR-5593 > Project: Solr > Issue Type: Improvement > Reporter: Christine Poerschke > Attachments: CoreAdminHandler.patch > > > The problem we saw was that the shard leader ceased to be shard leader (in > our case due to its zookeeper session expiring). The followers thus rejected > update requests (DistributedUpdateProcessor setupRequest's call to > ZkStateReader getLeaderRetry) and the leader asked them to recover > (DistributedUpdateProcessor doFinish). The followers published themselves as > recovering (CoreAdminHandler handleRequestRecoveryAction) and the shard > leader loss triggered an election in which none of the followers became the > leader due to their recovering state (ShardLeaderElectionContext > shouldIBeLeader). The former shard leader also did not become shard leader > because its new seq number placed it after the existing replicas > (LeaderElector checkIfIamLeader seq <= intSeqs.get(0)). -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org