[ https://issues.apache.org/jira/browse/ZOOKEEPER-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825270#comment-13825270 ]
Germán Blanco commented on ZOOKEEPER-1817: ------------------------------------------ This works in almost all cases. 3.4 is a server running the updated branch 3.4, patch included. 3.3 is a server running the latest code in branch 3.3. Rolling upgrade from 3.3 to 3.4 works. Rolling upgrade from 3.4.5 to 3.4 works, unless there is a leader election in the wrong moment. That is because a 3.4.5 server is not always able to join an ensemble of a 3.4.5 and a 3.4. However some of the elections do finish. I found two potential causes: 1 - election epoch reported by 3.4 follower after election is -1, instead of the round of the last election. This seems to be because of the change here: {noformat} } else { /* * If this server is not looking, but the one that sent the ack * is looking, then send back what it believes to be the leader. */ Vote current = self.getCurrentVote(); if(ackstate == QuorumPeer.ServerState.LOOKING){ if(LOG.isDebugEnabled()){ LOG.debug("Sending new notification. My id = " + self.getId() + " recipient=" + response.sid + " zxid=0x" + Long.toHexString(current.getZxid()) + " leader=" + current.getId()); } ToSend notmsg = new ToSend( ToSend.mType.notification, current.getId(), current.getZxid(), + current.getElectionEpoch(), - logicalclock, self.getPeerState(), response.sid, current.getPeerEpoch()); sendqueue.offer(notmsg); } } } {noformat} I am afraid this change was introduced by me in ZOOKEEPER-1732. The only purpose of the change was to be able to update the election epoch from FLETest. My assumption was that current.getElectionEpoch() was always the same as logicalclock when this function was called. I see now that this is not the case, and it causes problems. I suggest to put this back to what it was (logicalclock) and fix the test case if required. 2 - The value of n.round is different (because of the "newEpoch-1" issue). This could be fixed by removing the call to updateElectionVote in Leader.java, and changing the parameter from newEpoch to newEpoch-1 in Learner.java. I have tried these two changes and they seem to enable finishing the election for the 3.4.5 server joining the 3.4+3.4.5 ensemble every time. I can upload logs, but given the amount of combinations, sending everything would be a mess. If you are interested in the logs of any of the nodes in any of the rolling upgrade test cases, please let me know and I will send them. > Fix don't care for b3.4 > ----------------------- > > Key: ZOOKEEPER-1817 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1817 > Project: ZooKeeper > Issue Type: Sub-task > Reporter: Flavio Junqueira > Assignee: Flavio Junqueira > Priority: Blocker > Fix For: 3.4.6 > > Attachments: ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, > ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch > > > See umbrella jira. -- This message was sent by Atlassian JIRA (v6.1#6144)