[ https://issues.apache.org/jira/browse/ZOOKEEPER-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flavio Junqueira resolved ZOOKEEPER-1548. ----------------------------------------- Resolution: Duplicate > Cluster fails election loop in new and interesting way > ------------------------------------------------------ > > Key: ZOOKEEPER-1548 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1548 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.4.3 > Reporter: Alan Horn > Fix For: 3.4.6 > > Attachments: 1-follower, 2-follower, 3-leader > > > Hi, > We have a five node cluster, recently upgraded from 3.3.5 to 3.4.3. Was > running fine for a few weeks after the upgrade, then the following sequence > of events occurred : > 1. All servers stopped responding to 'ruok' at the same time > 2. Our local supervisor process restarted all of them at the same time > (yes, this is bad, we didn't expect it to fail this way :) > 3. The cluster would not serve requests after this. Appeared to be unable to > complete an election. > We tried various things at this point, none of which worked : > * Moved around the restart order of the nodes (e.g. 4 thru 0, instead of 0 > thru 4) > * Reduced number of running nodes from 5 -> 3 to simplify the quorum, by only > starting up 0, 1 & 2, in one test, and 0, 2 & 4 in the other > * Removed the *Epoch files from version-2/ snapshot directory > * Put the same version2/snapshot.xxxxx file on each server in the cluster > * Added the (same on all nodes) last txlog onto each cluster > * Kept only the last snapshot plus txlog unique on each server > * Moved leaderServes=no to leaderServes=yes > * Removed all files and started up with empty data as a control. This worked, > but of course isn't terribly useful :) > Finally, I brought the data up on a single node running in standalone and > this worked (yay!) So at this point we brought the single node back into > service and have kept the other four available to debug why the election is > failing. > We downgraded the four nodes to 3.3.5, and then they completed the election > and started serving as expected. > We did a rolling upgrade to 3.4.3, and everything was fine until we restarted > the leader, whereupon we encountered the same re-election loop as before. > We're a bit out of ideas at this point, so I was hoping someone from this > list might have some useful input. > Output from two followers and a leader during this condition are attached. > Cheers, > Al -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira