[jira] [Updated] (ZOOKEEPER-4816) A follower can not join the cluster for 20s seconds

mutu (Jira) Tue, 02 Jul 2024 19:34:12 -0700


     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


mutu updated ZOOKEEPER-4816:
----------------------------
    Attachment:     (was: node2.log)

> A follower can not join the cluster for 20s seconds
> ---------------------------------------------------
>
>                 Key: ZOOKEEPER-4816
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4816
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.10.0
>            Reporter: mutu
>            Priority: Critical
>
> We encounter a strange scenario. When we set up the cluster of zookeeper(3 
> nodes totally), the third node is stuck in ({*}sealStream{*}) serializing the 
> snapshot to the local disk. However, the leader election is executed 
> normally. After the election, the third node is elected as the leader. The 
> other two nodes fail to connect with the leader. Hence, the first and second 
> nodes restart the leader election, finally the second node is elected as the 
> leader. At this time, the third node still act as the leader. There are two 
> leaders in the cluster. The first node can not join the cluster for 20s. 
> The logs of the first node are as following.
> {code:java}
> 2024-03-12 07:20:51,552 [myid:] - INFO 
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:1, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2024-03-12 07:20:51,565 [myid:] - INFO 
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2024-03-12 07:20:51,594 [myid:] - INFO 
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:2, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO 
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:3, n.state:LEADING, n.leader:3, 
> n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO 
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@1205]
>  - Oracle indicates not to follow {code}
> During this procedure, the client can not connect with any nodes of the 
> cluster.
> Runtime logs are attached.
> The root cause is the serializing the snapshot blocks the status modification 
> of the third node?
> Are there any comments to figure out this issues？ I will very appreciate them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ZOOKEEPER-4816) A follower can not join the cluster for 20s seconds

Reply via email to