Hi all,
I ran into a weird case where Zookeeper seems to have elected a new
leader (a node that just restarted and missed several operations), and a
client is able to connect to it and create a new sequential node that
has a number earlier than the last node it created. I don't have full
logs, or a live system in this state, or any data directories, just some
partial server logs and the evidence as seen by the client. Haven't
tried reproducing it yet, just wanted to see if anyone here had any
ideas. Here's the scenario (probably more info than necessary, but
trying to be complete)
Version: Zookeeper 3.3.3
1) Initially (5:37:20): 3 nodes up, with ids 215, 126, and 37 (called
nodes #1, #2, and #3 below):
2) Nodes periodically (and throughout this whole timeline) create
sequential, non-ephemeral nodes under the /zkrsm parent node.
3) 5:46:57: Node #1 gets notified of
/zkrsm/0000000000000000_record0000002116
4) 5:47:06: Node #1 restarts and rejoins
5) 5:49:26: Node #2 gets notified of
/zkrsm/0000000000000000_record0000002708
6) 5:49:29: Node #2 restarts and rejoins
7) 5:52:01: Node #3 gets notified of
/zkrsm/0000000000000000_record0000003291
8) 5:52:02: Node #3 restarts and begins the rejoining process
9) 5:52:08: Node #1 successfully creates
/zkrsm/0000000000000000_record0000003348
10) 5:52:08: Node #2 dies after getting notified of
/zkrsm/0000000000000000_record0000003348
11) 5:52:10ish: Node #3 is elected leader (the ZK server log doesn't
have wallclock timestamps, so not exactly sure on the ordering of this step)
12) 5:52:15: Node #1 successfully creates
/zkrsm/0000000000000000_record0000003292
Note that the node created in step #12 is lower than the one created in
step #9, and is exactly one greater than the last node seen by node #3
before it restarted.
The leader election bit from node #3's log after restarting might be of
interest:
2644 [QuorumPeer:/0.0.0.0:2888] INFO
org.apache.zookeeper.server.quorum.QuorumPeer - LOOKING
2644 [QuorumPeer:/0.0.0.0:2888] DEBUG
org.apache.zookeeper.server.quorum.QuorumPeer - Initializing leader
election protocol...
2647 [QuorumPeer:/0.0.0.0:2888] INFO
org.apache.zookeeper.server.quorum.FastLeaderElection - New election.
My id = 37, Proposed zxid = 17179869831
2650 [WorkerSender Thread] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - Opening channel
to server 126
2653 [WorkerReceiver Thread] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new
notification message. My id = 37
2653 [WorkerReceiver Thread] INFO
org.apache.zookeeper.server.quorum.FastLeaderElection - Notification:
37 (n.leader), 17179869831 (n.zxid), 1 (n.round), LOOKING (n.state),
37 (n.sid), LOOKING (my state)
2654 [QuorumPeer:/0.0.0.0:2888] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - id: 37,
proposed id: 37, zxid: 17179869831, proposed zxid: 17179869831
2655 [QuorumPeer:/0.0.0.0:2888] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Adding vote:
From = 37, Proposed leader = 37, Porposed zxid = 17179869831, Proposed
epoch = 1
2657 [WorkerSender Thread] INFO
org.apache.zookeeper.server.quorum.QuorumCnxManager - Have smaller
server identifier, so dropping the connection: (126, 37)
2657 [WorkerSender Thread] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - Opening channel
to server 215
2661 [Thread-3] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - Connection
request /13.0.0.12:60623
2661 [Thread-3] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - Connection
request: 37
2661 [WorkerSender Thread] INFO
org.apache.zookeeper.server.quorum.QuorumCnxManager - Have smaller
server identifier, so dropping the connection: (215, 37)
2662 [Thread-3] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - Address of
remote peer: 126
2668 [Thread-3] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - Connection
request /13.0.0.11:60190
2668 [Thread-3] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - Connection
request: 37
2668 [Thread-3] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - Address of
remote peer: 215
2669 [WorkerReceiver Thread] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new
notification message. My id = 37
2670 [WorkerReceiver Thread] INFO
org.apache.zookeeper.server.quorum.FastLeaderElection - Notification:
215 (n.leader), 12884902548 (n.zxid), 3 (n.round), LOOKING (n.state),
126 (n.sid), LOOKING (my state)
2670 [QuorumPeer:/0.0.0.0:2888] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - id: 215,
proposed id: 37, zxid: 12884902548, proposed zxid: 17179869831
2670 [WorkerReceiver Thread] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new
notification message. My id = 37
2670 [QuorumPeer:/0.0.0.0:2888] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Adding vote:
From = 126, Proposed leader = 215, Porposed zxid = 12884902548,
Proposed epoch = 3
2670 [WorkerReceiver Thread] INFO
org.apache.zookeeper.server.quorum.FastLeaderElection - Notification:
215 (n.leader), 12884902548 (n.zxid), 3 (n.round), FOLLOWING
(n.state), 126 (n.sid), LOOKING (my state)
2671 [WorkerSender Thread] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a
connection already for server 126
2671 [WorkerSender Thread] DEBUG
org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a
connection already for server 215
2671 [WorkerReceiver Thread] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new
notification message. My id = 37
2671 [WorkerReceiver Thread] INFO
org.apache.zookeeper.server.quorum.FastLeaderElection - Notification:
37 (n.leader), 17179869831 (n.zxid), 3 (n.round), LOOKING (n.state),
37 (n.sid), LOOKING (my state)
2671 [WorkerReceiver Thread] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new
notification message. My id = 37
2671 [WorkerReceiver Thread] INFO
org.apache.zookeeper.server.quorum.FastLeaderElection - Notification:
215 (n.leader), 12884902548 (n.zxid), 3 (n.round), LOOKING (n.state),
215 (n.sid), LOOKING (my state)
2671 [QuorumPeer:/0.0.0.0:2888] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - id: 37,
proposed id: 37, zxid: 17179869831, proposed zxid: 17179869831
2672 [WorkerReceiver Thread] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new
notification message. My id = 37
2672 [QuorumPeer:/0.0.0.0:2888] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Adding vote:
From = 37, Proposed leader = 37, Porposed zxid = 17179869831, Proposed
epoch = 3
2672 [WorkerReceiver Thread] INFO
org.apache.zookeeper.server.quorum.FastLeaderElection - Notification:
215 (n.leader), 12884902548 (n.zxid), 3 (n.round), LEADING (n.state),
215 (n.sid), LOOKING (my state)
2672 [QuorumPeer:/0.0.0.0:2888] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - id: 215,
proposed id: 37, zxid: 12884902548, proposed zxid: 17179869831
2672 [QuorumPeer:/0.0.0.0:2888] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Adding vote:
From = 215, Proposed leader = 215, Porposed zxid = 12884902548,
Proposed epoch = 3
2672 [WorkerReceiver Thread] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new
notification message. My id = 37
2672 [WorkerReceiver Thread] INFO
org.apache.zookeeper.server.quorum.FastLeaderElection - Notification:
215 (n.leader), 12884902548 (n.zxid), 3 (n.round), FOLLOWING
(n.state), 126 (n.sid), LEADING (my state)
2673 [QuorumPeer:/0.0.0.0:2888] INFO
org.apache.zookeeper.server.quorum.QuorumPeer - LEADING
2675 [QuorumPeer:/0.0.0.0:2888] INFO
org.apache.zookeeper.server.quorum.Leader - TCP NoDelay set to: true
2693 [WorkerReceiver Thread] DEBUG
org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new
notification message. My id = 37
I don't know much about the internal workings of Zookeeper, but what I
find weird here is that node #3 proposes zxid 17179869831 (0x400000287),
which seems to indicate that it wants to start well into epoch 4 before
it even opens channels to other nodes, finding out that their zxids are
only 12884902548 (0x300000294). Is it possible that node #3 thought its
zxid was not less than the other nodes' zxids, and so did not bother
transferring the operations it missed?
Any other ideas on what could have gone wrong? I dont see any other
obvious exceptions in node #3's log, except:
2487 [pool-1-thread-2] DEBUG
org.apache.zookeeper.server.persistence.FileTxnLog - EOF excepton
java.io.EOFException: Failed to read
but that seems to happen all the time, and it's at DEBUG level, so I'm
guessing it's expected.
Thanks,
Jeremy