[ https://issues.apache.org/jira/browse/ZOOKEEPER-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009845#comment-13009845 ]
Camille Fournier commented on ZOOKEEPER-1026: --------------------------------------------- These logs are very difficult to read, but one thing I notice is this log from node1_after_restart: 314258 [Thread-3] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - Address of remote peer: 126 314267 [WorkerReceiver Thread] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new notification message. My id = 215 314267 [WorkerReceiver Thread] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification: 126 (n.leader), 17179869911 (n.zxid), 1 (n.round), LOOKING (n.state), 126 (n.sid), LOOKING (my state) 314267 [QuorumPeer:/0.0.0.0:2888] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Notification epoch is smaller than logicalclock. n.epoch = 1, Logical clock4 314267 [WorkerSender Thread] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a connection already for server 126 314273 [WorkerReceiver Thread] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new notification message. My id = 215 314273 [WorkerReceiver Thread] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification: 126 (n.leader), 17179869911 (n.zxid), 3 (n.round), LOOKING (n.state), 126 (n.sid), LOOKING (my state) 314273 [QuorumPeer:/0.0.0.0:2888] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Notification epoch is smaller than logicalclock. n.epoch = 3, Logical clock4 314273 [WorkerSender Thread] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a connection already for server 126 314274 [WorkerReceiver Thread] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new notification message. My id = 215 314274 [WorkerReceiver Thread] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification: 37 (n.leader), 17179869831 (n.zxid), 3 (n.round), FOLLOWING (n.state), 126 (n.sid), LOOKING (my state) 314274 [QuorumPeer:/0.0.0.0:2888] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - I'm a participant: 215 QuorumPeer:/0.0.0.0:2888] INFO org.apache.zookeeper.server.quorum.QuorumPeer - FOLLOWING Specifically, that the n.zxid starts at 17179869911, and goes to 17179869831 when election finishes. And in fact node3 does believe it is LEADING according to the logs. It's also confusing because the node3 logs have a bunch of lines like this: 6397 [WorkerReceiver Thread] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification: 215 (n.leader), 17179869918 (n.zxid), 4 (n.round), LOOKING (n.state), 215 (n.sid), LEADING (my state) Why is my state LEADING, but n.leader is node1, but node1 is also LOOKING? > Sequence number assignment decreases after old node rejoins cluster > ------------------------------------------------------------------- > > Key: ZOOKEEPER-1026 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1026 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.3.3 > Reporter: Jeremy Stribling > Attachments: ZOOKEEPER-1026.logs.tgz > > > I ran into a weird case where a Zookeeper server rejoins the cluster after > missing several operations, and then a client creates a new sequential node > that has a number earlier than the last node it created. I don't have full > logs, or a live system in this state, or any data directories, just some > partial server logs and the evidence as seen by the client. Haven't tried > reproducing it yet, just wanted to see if anyone here had any ideas. Here's > the scenario (probably more info than necessary, but trying to be complete) > 1) Initially (5:37:20): 3 nodes up, with ids 215, 126, and 37 (called nodes > #1, #2, and #3 below): > 2) Nodes periodically (and throughout this whole timeline) create sequential, > non-ephemeral nodes under the /zkrsm parent node. > 3) 5:46:57: Node #1 gets notified of /zkrsm/0000000000000000_record0000002116 > 4) 5:47:06: Node #1 restarts and rejoins > 5) 5:49:26: Node #2 gets notified of /zkrsm/0000000000000000_record0000002708 > 6) 5:49:29: Node #2 restarts and rejoins > 7) 5:52:01: Node #3 gets notified of /zkrsm/0000000000000000_record0000003291 > 8) 5:52:02: Node #3 restarts and begins the rejoining process > 9) 5:52:08: Node #1 successfully creates > /zkrsm/0000000000000000_record0000003348 > 10) 5:52:08: Node #2 dies after getting notified of > /zkrsm/0000000000000000_record0000003348 > 11) 5:52:10ish: Node #3 is elected leader (the ZK server log doesn't have > wallclock timestamps, so not exactly sure on the ordering of this step) > 12) 5:52:15: Node #1 successfully creates > /zkrsm/0000000000000000_record0000003292 > Note that the node created in step #12 is lower than the one created in step > #9, and is exactly one greater than the last node seen by node #3 before it > restarted. > Here is the sequence of session establishments as seen from the C client of > node #1 after its restart (the IP address of node #1=13.0.0.11, #2=13.0.0.12, > #3=13.0.0.13): > 2011-03-18 05:46:59,838:17454(0x7fc57d3db710):ZOO_INFO@check_events@1632: > session establishment complete on server [13.0.0.13:2888], > sessionId=0x252ec780a3020000, negotiated timeout=6000 > 2011-03-18 05:49:32,194:17454(0x7fc57cbda710):ZOO_INFO@check_events@1632: > session establishment complete on server [13.0.0.13:2888], > sessionId=0x252ec782f5100002, negotiated timeout=6000 > 2011-03-18 05:52:02,352:17454(0x7fc57d3db710):ZOO_INFO@check_events@1632: > session establishment complete on server [13.0.0.12:2888], > sessionId=0x7e2ec782ff5f0001, negotiated timeout=6000 > 2011-03-18 05:52:08,583:17454(0x7fc57d3db710):ZOO_INFO@check_events@1632: > session establishment complete on server [13.0.0.11:2888], > sessionId=0x7e2ec782ff5f0001, negotiated timeout=6000 > 2011-03-18 05:52:13,834:17454(0x7fc57cbda710):ZOO_INFO@check_events@1632: > session establishment complete on server [13.0.0.11:2888], > sessionId=0xd72ec7856d0f0001, negotiated timeout=6000 > I will attach logs for all nodes after each of their restarts, and a partial > log for node #3 from before its restart. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira