[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825270#comment-13825270
 ] 

Germán Blanco commented on ZOOKEEPER-1817:
------------------------------------------

This works in almost all cases.
3.4 is a server running the updated branch 3.4, patch included.
3.3 is a server running the latest code in branch 3.3.
Rolling upgrade from 3.3 to 3.4 works.
Rolling upgrade from 3.4.5 to 3.4 works, unless there is a leader election in 
the wrong moment.
That is because a 3.4.5 server is not always able to join an ensemble of a 
3.4.5 and a 3.4. However some of the elections do finish. I found two potential 
causes:
1 - election epoch reported by 3.4 follower after election is -1, instead of 
the round of the last election. This seems to be because of the change here:
{noformat}
                            } else {
                                /*
                                 * If this server is not looking, but the one 
that sent the ack
                                 * is looking, then send back what it believes 
to be the leader.
                                 */
                                Vote current = self.getCurrentVote();
                                if(ackstate == QuorumPeer.ServerState.LOOKING){
                                    if(LOG.isDebugEnabled()){
                                        LOG.debug("Sending new notification. My 
id =  " +
                                                self.getId() + " recipient=" +
                                                response.sid + " zxid=0x" +
                                                
Long.toHexString(current.getZxid()) +
                                                " leader=" + current.getId());
                                    }
                                    ToSend notmsg = new ToSend(
                                            ToSend.mType.notification,
                                            current.getId(),
                                            current.getZxid(),
+                                           current.getElectionEpoch(),
-                                           logicalclock,
                                            self.getPeerState(),
                                            response.sid,
                                            current.getPeerEpoch());
                                    sendqueue.offer(notmsg);
                                }
                            }
                        }
{noformat}
I am afraid this change was introduced by me in ZOOKEEPER-1732. The only 
purpose of the change was to be able to update the election epoch from FLETest. 
My assumption was that current.getElectionEpoch() was always the same as 
logicalclock when this function was called. I see now that this is not the 
case, and it causes problems. I suggest to put this back to what it was 
(logicalclock) and fix the test case if required.
2 - The value of n.round is different (because of the "newEpoch-1" issue). This 
could be fixed by removing the call to updateElectionVote in Leader.java, and 
changing the parameter from newEpoch to newEpoch-1 in Learner.java.
I have tried these two changes and they seem to enable finishing the election 
for the 3.4.5 server joining the 3.4+3.4.5 ensemble every time.
I can upload logs, but given the amount of combinations, sending everything 
would be a mess. If you are interested in the logs of any of the nodes in any 
of the rolling upgrade test cases, please let me know and I will send them.

> Fix don't care for b3.4
> -----------------------
>
>                 Key: ZOOKEEPER-1817
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1817
>             Project: ZooKeeper
>          Issue Type: Sub-task
>            Reporter: Flavio Junqueira
>            Assignee: Flavio Junqueira
>            Priority: Blocker
>             Fix For: 3.4.6
>
>         Attachments: ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, 
> ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch
>
>
> See umbrella jira.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to