Hi, I was wondering if somebody has seen Zookeeper 3.8.4 in this state before, in the debug logs I see the following log messages (Debug) level, one of the instances (A Kubernetes pod) was stuck in this state for days:
[INFO] 2025-02-18 10:11:43,985 [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] org.apache.zookeeper.server.quorum.FastLeaderElection lookForLeader - Notification time out: 60000 ms [DEBUG] 2025-02-18 10:12:43,986 [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] org.apache.zookeeper.server.quorum.QuorumCnxManager haveDelivered - Queue size: 0 [DEBUG] 2025-02-18 10:12:43,986 [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] org.apache.zookeeper.server.quorum.FastLeaderElection sendNotifications - Sending Notification: 11 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 0 (recipient), 0 (myid), 0x0 (n.peerEpoch) [DEBUG] 2025-02-18 10:12:43,986 [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] org.apache.zookeeper.server.quorum.FastLeaderElection sendNotifications - Sending Notification: 11 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 1 (recipient), 0 (myid), 0x0 (n.peerEpoch) [DEBUG] 2025-02-18 10:12:43,986 [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] org.apache.zookeeper.server.quorum.FastLeaderElection sendNotifications - Sending Notification: 11 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 2 (recipient), 0 (myid), 0x0 (n.peerEpoch) [DEBUG] 2025-02-18 10:12:43,986 [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] org.apache.zookeeper.server.quorum.FastLeaderElection sendNotifications - Sending Notification: 11 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 10 (recipient), 0 (myid), 0x0 (n.peerEpoch) [DEBUG] 2025-02-18 10:12:43,986 [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] org.apache.zookeeper.server.quorum.FastLeaderElection sendNotifications - Sending Notification: 11 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 11 (recipient), 0 (myid), 0x0 (n.peerEpoch) [INFO] 2025-02-18 10:12:43,986 [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] org.apache.zookeeper.server.quorum.FastLeaderElection lookForLeader - Notification time out: 60000 ms [DEBUG] 2025-02-18 10:13:43,986 [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] org.apache.zookeeper.server.quorum.QuorumCnxManager haveDelivered - Queue size: 0 It seems to be stuck in a loop (line 958) here in this function: https://github.com/apache/zookeeper/blob/release-3.8.4-0/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L912 , but my java skills are pretty rusted so I might be wrong. The only thing that managed to get this instance outside of the loop in line 958 and make it reply to requests on port 2181 again, was restarting the leader. That triggered a new election, and somehow the pod immediately got out of the loop. (I guess because the getPeerStatus() return value changed after the election). Is this somehow a bug or is this expected behavior? What is also weird is that in the prometheus metrics, the pod was reporting being a leader as well, but when we for example used stat on port 2181, we got the message that Zookeeper was not serving requests (I guess this might also be a bug). Do I need to tweak any settings to prevent this from happening again in the future? I am wondering if tweaking the backoff values for the fastleaderelection would have any impact on this issue or not. Thanks, Victor. [https://opengraph.githubassets.com/8501d20dc0bfa14de7322b0f1ce120a7ea27069d93b7a5e8b46919b1b2770963/apache/zookeeper]<https://github.com/apache/zookeeper/blob/release-3.8.4-0/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L912> zookeeper/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java at release-3.8.4-0 ยท apache/zookeeper<https://github.com/apache/zookeeper/blob/release-3.8.4-0/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L912> Apache ZooKeeper. Contribute to apache/zookeeper development by creating an account on GitHub. github.com This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
