[ https://issues.apache.org/jira/browse/ZOOKEEPER-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
mutu updated ZOOKEEPER-4816: ---------------------------- Attachment: (was: node2.log) > A follower can not join the cluster for 20s seconds > --------------------------------------------------- > > Key: ZOOKEEPER-4816 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4816 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.10.0 > Reporter: mutu > Priority: Critical > > We encounter a strange scenario. When we set up the cluster of zookeeper(3 > nodes totally), the third node is stuck in ({*}sealStream{*}) serializing the > snapshot to the local disk. However, the leader election is executed > normally. After the election, the third node is elected as the leader. The > other two nodes fail to connect with the leader. Hence, the first and second > nodes restart the leader election, finally the second node is elected as the > leader. At this time, the third node still act as the leader. There are two > leaders in the cluster. The first node can not join the cluster for 20s. > The logs of the first node are as following. > {code:java} > 2024-03-12 07:20:51,552 [myid:] - INFO > [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391] > - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:1, > n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, > n.config version:0x0 2024-03-12 07:20:51,565 [myid:] - INFO > [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391] > - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, > n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, > n.config version:0x0 2024-03-12 07:20:51,594 [myid:] - INFO > [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391] > - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:2, > n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, > n.config version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO > [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391] > - Notification: my state:LOOKING; n.sid:3, n.state:LEADING, n.leader:3, > n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, > n.config version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO > [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@1205] > - Oracle indicates not to follow {code} > During this procedure, the client can not connect with any nodes of the > cluster. > Runtime logs are attached. > The root cause is the serializing the snapshot blocks the status modification > of the third node? > Are there any comments to figure out this issues? I will very appreciate them. -- This message was sent by Atlassian Jira (v8.20.10#820010)