[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959941#comment-13959941
 ] 

Mark Miller commented on SOLR-5952:
-----------------------------------

I've got ApacheCon coming up next week, so I might be a bit behind on things, 
but I want to try and get this addressed pretty soon.

> Recovery race/ error
> --------------------
>
>                 Key: SOLR-5952
>                 URL: https://issues.apache.org/jira/browse/SOLR-5952
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7
>            Reporter: Jessica Cheng
>            Assignee: Mark Miller
>              Labels: leader, recovery, solrcloud, zookeeper
>             Fix For: 4.8, 5.0
>
>         Attachments: recovery-failure-host1-log.txt, 
> recovery-failure-host2-log.txt
>
>
> We're seeing some shard recovery errors in our cluster when a zookeeper 
> "error event" happened. In this particular case, we had two replicas. The 
> event from the logs look roughly like this:
> 18:40:36 follower (host2) disconnected from zk
> 18:40:38 original leader started recovery (there was no log about why it 
> needed recovery though) and failed because cluster state still says it's the 
> leader
> 18:40:39 follower successfully connected to zk after some trouble
> 19:03:35 follower register core/replica
> 19:16:36 follower registration fails due to no leader (NoNode for 
> /collections/test-1/leaders/shard2)
> Essentially, I think the problem is that the isLeader property on the cluster 
> state is never cleaned up, so neither replicas are able to recover/register 
> in order to participate in leader election again.
> Looks like from the code that the only place that the isLeader property is 
> cleared from the cluster state is from ElectionContext.runLeaderProcess, 
> which assumes that the replica with the next election seqId will notice the 
> leader's node disappearing and run the leader process. This assumption fails 
> in this scenario because the follower experienced the same zookeeper "error 
> event" and never handled the event of the leader going away. (Mark, this is 
> where I was saying in SOLR-3582 that maybe the watcher in 
> LeaderElector.checkIfIamLeader does need to handle "Expired" by somehow 
> realizing that the leader is gone and clearing the isLeader state at least, 
> but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to