[jira] [Commented] (ZOOKEEPER-1026) Sequence number assignment decreases after old node rejoins cluster

Vishal K (JIRA) Wed, 23 Mar 2011 11:33:49 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010278#comment-13010278
 ]


Vishal K commented on ZOOKEEPER-1026:
-------------------------------------

Hi Jeremy,

I think I know whats going on here. The bug description that I gave in
my earlier comment is causing legitimate transactions to
get truncated.

You were right about node 3 becoming the leader after reboot.
I have given the order of events and my explanation below. Note that
the order of events is slightly different from what you described.

1. After your step 6, node 1 becomes the leader (its ID is > node 3's
id). Heres what node1 is saying in terms of membership and elections rounds.

a) 2048:151124 [QuorumPeer:/0.0.0.0:2888] WARN 
org.apache.zookeeper.server.quorum.Learner  - Exception when following the 
leader <==== NODE 2 WENT DOWN HERE (your step #6)
b) 13497:  at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:375)  
===> NODE 3 WENT DOWN HERE (your step #8)
c) 17926:309591 [LearnerHandler-/13.0.0.12:52753] WARN 
org.apache.zookeeper.server.quorum.LearnerHandler  - ******* GOODBYE 
/13.0.0.12:52753 ******** ===> NODE2 WENT DOWN HERE AGAIN (Your step #10)

d) 18044 310757 [QuorumPeer:/0.0.0.0:2888] INFO 
org.apache.zookeeper.server.quorum.Leader  - Shutdown called ===> NODE1 
*shutdown* again? (This is because it lost majority as explained below)
18045     at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:390)
18046     at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:367)
18047     at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:658)


e) 23685:1831 [pool-1-thread-3] INFO 
org.apache.zookeeper.server.quorum.QuorumPeer  - tickTime set to 3000 ====> 
NODE1 restarted again (this is because of your client assert)

2. Notice that the time difference between b) and c) above is only 4.5
seconds. So you rebooted node 2 before node 3 had joined the cluster. 
As a result node 1 lost its majority and gave up leadership. You can
see that shutdown is called from Leader.java:367, which is done when
the leader cannot ping the followers. Your application would have seen
a DISCONNECT before it asserted.

What took node 3 so long to join the cluster? As described in ZOOKEEPER-975
and my previous comment, 3 went into LEADING state because of the bug
in FastLeaderElection. After 3 reboots, nodes 1 and 2 send old
notifications to 3. When 3 receives notifications of all nodes, it
goes in the leading state. Heres the incorrect if condition from 
FastLeaderElection

                           //If have received from all nodes, then terminate
                           if ((self.getVotingView().size() == recvset.size()) 
&&
                                   
(self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
                               self.setPeerState((proposedLeader == 
self.getId()) ?
                                       ServerState.LEADING: learningState());
                               leaveInstance();
                               return new Vote(proposedLeader, proposedZxid);

                           }

3. Now, 3 is in LEADING state and it will remain in the LEADING state
until ticktime * initTime or until a majority of followers start
following 3.

In the mean time, 2 boots and starts leader election. 2 receives a
notification from 3, which claims 3 to be the leader.

2533 [WorkerReceiver Thread] INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection  - Notification: 37 
(n.leader), 17179869831 (n.zxid), 3 (n.round), LEADING (n.state), 37 (n.sid), 
LOOKING (my state)
2534 [QuorumPeer:/0.0.0.0:2888] DEBUG 
org.apache.zookeeper.server.quorum.FastLeaderElection  - id: 126, proposed id: 
126, zxid: 17179869911, proposed zxid: 17179869911
2534 [QuorumPeer:/0.0.0.0:2888] DEBUG 
org.apache.zookeeper.server.quorum.FastLeaderElection  - Adding vote: From = 
126, Proposed leader = 126, Porposed zxid = 17179869911, Proposed epoch = 1
2534 [QuorumPeer:/0.0.0.0:2888] DEBUG 
org.apache.zookeeper.server.quorum.FastLeaderElection  - id: 37, proposed id: 
126, zxid: 17179869831, proposed zxid: 17179869911
2534 [QuorumPeer:/0.0.0.0:2888] DEBUG 
org.apache.zookeeper.server.quorum.FastLeaderElection  - Adding vote: From = 
37, Proposed leader = 37, Porposed zxid = 17179869831, Proposed epoch = 3
2534 [QuorumPeer:/0.0.0.0:2888] DEBUG 
org.apache.zookeeper.server.quorum.FastLeaderElection  - I'm a participant: 126

2 blindly believes that 3 is the leader and starts following 3! As a
result, 2's state is rolled-back (because 3 has old zxid). So 2
truncates its transactions logs.

Heres the incorrect if statement:

                   default:
                       /*
                        * There is at most one leader for each epoch, so if a
                        * peer claims to be the leader for an epoch, then that
                        * peer must be the leader (no* arbitrary failures
                        * assumed). Now, if there is no quorum supporting
                        * this leader, then processes will naturally move
                        * to a new epoch.
                        */
                       if(n.epoch == logicalclock){
                           recvset.put(n.sid, new Vote(n.leader, n.zxid, 
n.epoch));
                           if((n.state == ServerState.LEADING) ||
                                   (termPredicate(recvset, new Vote(n.leader,
                                           n.zxid, n.epoch, n.state))
                                           && checkLeader(outofelection, 
n.leader, n.epoch)) ){
                               self.setPeerState((n.leader == self.getId()) ?
                                       ServerState.LEADING: learningState());

                               leaveInstance();
                               return new Vote(n.leader, n.zxid);
                           }
                       }

3. Now, 1 is also running leader election (it does this immediately
after loosing majority).

1 now receives votes from 2 and 3 that say that 3 is the leader. So, 1
starts following 3.

18427 314274 [WorkerReceiver Thread] DEBUG 
org.apache.zookeeper.server.quorum.FastLeaderElection  - Receive new 
notification message. My id = 215
18428 314274 [WorkerReceiver Thread] INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection  - Notification: 37 
(n.leader), 17179869831 (n.zxid), 3 (n.round), FOLLOWING (n.state), 126 
(n.sid),       LOOKING (my state)
18429 314274 [QuorumPeer:/0.0.0.0:2888] DEBUG 
org.apache.zookeeper.server.quorum.FastLeaderElection  - I'm a participant: 215
18430 314274 [QuorumPeer:/0.0.0.0:2888] INFO 
org.apache.zookeeper.server.quorum.QuorumPeer  - FOLLOWING


As a result, 1 ends up truncating its transactions as well. In order
for your client to see 3292 znode, the ensemble would have to
rolled-back at least (3348 - 3291) 57 transactions. We can see that
both 1 and 2 went back (17179869911 - 17179869831) 80 transactions.

In short, there are several places where a peer goes in correct
states. The main reason for this is that it does not always rely on <epoch, id,
zxid> to determine leadership. ZOKEEPER-975 had identified some of
them, and the fix for ZOKEEPER-975 would have prevented this bug.

I already have a working patch for ZOKEEPER-975 that I did not attach to the 
bug,
because I had implemented it on top of ZOOKEEPER-932. To avoid back
porting and testing efforts, I was hoping that ZOOKEEPER-932 would get
reviewed soon so that I can post a patch for ZOOKEEPER-975. It turns
out that ZOOKEEPER-975 is a blocker, so I will pull out the changes
for ZOOKEEPER-975 and submit it for review. We might have to do more
changes to cleanup FastLeaderElection, but my patch for ZOOKEEPER-975
should at least get rid of the bug that you are seeing.

Let me know if you think I have miscalculated things here.

> Sequence number assignment decreases after old node rejoins cluster
> -------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1026
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1026
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Jeremy Stribling
>         Attachments: ZOOKEEPER-1026.logs.tgz
>
>
> I ran into a weird case where a Zookeeper server rejoins the cluster after 
> missing several operations, and then a client creates a new sequential node 
> that has a number earlier than the last node it created.  I don't have full 
> logs, or a live system in this state, or any data directories, just some 
> partial server logs and the evidence as seen by the client.  Haven't tried 
> reproducing it yet, just wanted to see if anyone here had any ideas.  Here's 
> the scenario (probably more info than necessary, but trying to be complete)
> 1) Initially (5:37:20): 3 nodes up, with ids 215, 126, and 37 (called nodes 
> #1, #2, and #3 below):
> 2) Nodes periodically (and throughout this whole timeline) create sequential, 
> non-ephemeral nodes under the /zkrsm parent node.
> 3) 5:46:57: Node #1 gets notified of /zkrsm/0000000000000000_record0000002116
> 4) 5:47:06: Node #1 restarts and rejoins
> 5) 5:49:26: Node #2 gets notified of /zkrsm/0000000000000000_record0000002708
> 6) 5:49:29: Node #2 restarts and rejoins
> 7) 5:52:01: Node #3 gets notified of /zkrsm/0000000000000000_record0000003291
> 8) 5:52:02: Node #3 restarts and begins the rejoining process
> 9) 5:52:08: Node #1 successfully creates 
> /zkrsm/0000000000000000_record0000003348
> 10) 5:52:08: Node #2 dies after getting notified of 
> /zkrsm/0000000000000000_record0000003348
> 11) 5:52:10ish: Node #3 is elected leader (the ZK server log doesn't have 
> wallclock timestamps, so not exactly sure on the ordering of this step)
> 12) 5:52:15: Node #1 successfully creates 
> /zkrsm/0000000000000000_record0000003292
> Note that the node created in step #12 is lower than the one created in step 
> #9, and is exactly one greater than the last node seen by node #3 before it 
> restarted.
> Here is the sequence of session establishments as seen from the C client of 
> node #1 after its restart (the IP address of node #1=13.0.0.11, #2=13.0.0.12, 
> #3=13.0.0.13):
> 2011-03-18 05:46:59,838:17454(0x7fc57d3db710):ZOO_INFO@check_events@1632: 
> session establishment complete on server [13.0.0.13:2888], 
> sessionId=0x252ec780a3020000, negotiated timeout=6000
> 2011-03-18 05:49:32,194:17454(0x7fc57cbda710):ZOO_INFO@check_events@1632: 
> session establishment complete on server [13.0.0.13:2888], 
> sessionId=0x252ec782f5100002, negotiated timeout=6000
> 2011-03-18 05:52:02,352:17454(0x7fc57d3db710):ZOO_INFO@check_events@1632: 
> session establishment complete on server [13.0.0.12:2888], 
> sessionId=0x7e2ec782ff5f0001, negotiated timeout=6000
> 2011-03-18 05:52:08,583:17454(0x7fc57d3db710):ZOO_INFO@check_events@1632: 
> session establishment complete on server [13.0.0.11:2888], 
> sessionId=0x7e2ec782ff5f0001, negotiated timeout=6000
> 2011-03-18 05:52:13,834:17454(0x7fc57cbda710):ZOO_INFO@check_events@1632: 
> session establishment complete on server [13.0.0.11:2888], 
> sessionId=0xd72ec7856d0f0001, negotiated timeout=6000
> I will attach logs for all nodes after each of their restarts, and a partial 
> log for node #3 from before its restart.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1026) Sequence number assignment decreases after old node rejoins cluster

Reply via email to