[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959941#comment-13959941 ]
Mark Miller commented on SOLR-5952: ----------------------------------- I've got ApacheCon coming up next week, so I might be a bit behind on things, but I want to try and get this addressed pretty soon. > Recovery race/ error > -------------------- > > Key: SOLR-5952 > URL: https://issues.apache.org/jira/browse/SOLR-5952 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.7 > Reporter: Jessica Cheng > Assignee: Mark Miller > Labels: leader, recovery, solrcloud, zookeeper > Fix For: 4.8, 5.0 > > Attachments: recovery-failure-host1-log.txt, > recovery-failure-host2-log.txt > > > We're seeing some shard recovery errors in our cluster when a zookeeper > "error event" happened. In this particular case, we had two replicas. The > event from the logs look roughly like this: > 18:40:36 follower (host2) disconnected from zk > 18:40:38 original leader started recovery (there was no log about why it > needed recovery though) and failed because cluster state still says it's the > leader > 18:40:39 follower successfully connected to zk after some trouble > 19:03:35 follower register core/replica > 19:16:36 follower registration fails due to no leader (NoNode for > /collections/test-1/leaders/shard2) > Essentially, I think the problem is that the isLeader property on the cluster > state is never cleaned up, so neither replicas are able to recover/register > in order to participate in leader election again. > Looks like from the code that the only place that the isLeader property is > cleared from the cluster state is from ElectionContext.runLeaderProcess, > which assumes that the replica with the next election seqId will notice the > leader's node disappearing and run the leader process. This assumption fails > in this scenario because the follower experienced the same zookeeper "error > event" and never handled the event of the leader going away. (Mark, this is > where I was saying in SOLR-3582 that maybe the watcher in > LeaderElector.checkIfIamLeader does need to handle "Expired" by somehow > realizing that the leader is gone and clearing the isLeader state at least, > but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org