[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira resolved ZOOKEEPER-1548.
-----------------------------------------

    Resolution: Duplicate
    
> Cluster fails election loop in new and interesting way
> ------------------------------------------------------
>
>                 Key: ZOOKEEPER-1548
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1548
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.3
>            Reporter: Alan Horn
>             Fix For: 3.4.6
>
>         Attachments: 1-follower, 2-follower, 3-leader
>
>
> Hi,
> We have a five node cluster, recently upgraded from 3.3.5 to 3.4.3. Was 
> running fine for a few weeks after the upgrade, then the following sequence 
> of events occurred :
> 1. All servers stopped responding to 'ruok' at the same time
> 2. Our local supervisor process restarted all of them at the same time 
> (yes, this is bad, we didn't expect it to fail this way :)
> 3. The cluster would not serve requests after this. Appeared to be unable to 
> complete an election.
> We tried various things at this point, none of which worked :
> * Moved around the restart order of the nodes (e.g. 4 thru 0, instead of 0 
> thru 4)
> * Reduced number of running nodes from 5 -> 3 to simplify the quorum, by only 
> starting up 0, 1 & 2, in one test, and  0, 2 & 4 in the other
> * Removed the *Epoch files from version-2/ snapshot directory
> * Put the same version2/snapshot.xxxxx file on each server in the cluster
> * Added the (same on all nodes) last txlog onto each cluster
> * Kept only the last snapshot plus txlog unique on each server
> * Moved leaderServes=no to leaderServes=yes
> * Removed all files and started up with empty data as a control. This worked, 
> but of course isn't terribly useful :)
> Finally, I brought the data up on a single node running in standalone and 
> this worked (yay!) So at this point we brought the single node back into 
> service and have kept the other four available to debug why the election is 
> failing.
> We downgraded the four nodes to 3.3.5, and then they completed the election 
> and started serving as expected.
> We did a rolling upgrade to 3.4.3, and everything was fine until we restarted 
> the leader, whereupon we encountered the same re-election loop as before.
> We're a bit out of ideas at this point, so I was hoping someone from this 
> list might have some useful input.
> Output from two followers and a leader during this condition are attached.
> Cheers,
> Al

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to