[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050796#comment-18050796
 ] 

lujingyu commented on ZOOKEEPER-5010:
-------------------------------------

If possible, could someone please help me confirm whether this is a new bug 
different from previous ones? I would really appreciate it!

> Leader DIFF synchronization does not remove stale ephemeral znodes on 
> followers, causing permanent state divergence.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-5010
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5010
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.8.5
>            Reporter: lujingyu
>            Priority: Major
>         Attachments: zookeeper--server-NC1ZK1.txt, 
> zookeeper--server-NC1ZK2.txt, zookeeper--server-NC1ZK3.txt, 
> zookeeper--server-NC1ZK4.txt, zookeeper--server-NC1ZK5.txt
>
>
> We observed a persistent state divergence in a ZOOKEEPER cluster where an 
> ephemeral znode remains present on some followers but is absent on the 
> leader. This divergence persists across leader changes and does not self-heal.
> The issue occurs in the scenario where the original leader suddenly crashes 
> and the cluster undergoes a re-leader-election. The new leader synchronizes 
> with all followers using DIFF, but some followers' in-memory states contain 
> ephemeral nodes that are semantically invalid and have never been part of the 
> leader's final committed history.
> --------------------------------------------------------------------------------------------------------------------------------
> The acutal testing scenario is as following:
> The cluster has five nodes: NC1ZK1(172.20.0.2), NC1ZK2(172.20.0.3), 
> NC1ZK3(172.20.0.4), NC1ZK4(172.20.0.5), NC1ZK5(172.20.0.6)
>  # Start a ZOOKEEPER cluster with multiple servers.
>  # A client(ZK1Cli) connects to the cluster and performs the following 
> operations:
>  ** create a persistent znode
>  ** update the znode
>  ** delete the znode
>  ** request creation of an ephemeral znode /eph
>  # After the processing of the ephemeral znode creation, the current 
> leader(NC1ZK3) crashes while deserialize quorum packets
> {quote}org.apache.zookeeper.server.quorum.QuorumPacket.deserialize$$PHOSPHORTAGGED(QuorumPacket.java:85),
>  
> org.apache.jute.BinaryInputArchive.readRecord$$PHOSPHORTAGGED(BinaryInputArchive.java:136),
>  
> org.apache.zookeeper.server.quorum.LearnerHandler.run$$PHOSPHORTAGGED(LearnerHandler.java:656),
>  org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java)
> {quote}
>  # The client session terminates shortly after.
>  # Multiple leader elections occurred in the cluster before a new 
> client(ZK1Cli2) attempted to read the ephemeral znode /eph that should not 
> have existed.
>  # There is a special situation where, after a leader election, before the 
> newly elected leader(NC1ZK5) crashes, a previously crashed node(NC1ZK3) 
> rejoins the cluster and becomes a follower of the new leader(NC1ZK5).
>  # And then the leader(NC1ZK5) is crashed.
>  # After the cluster re-election, NC1ZK4 was elected as the new leader.
>  # NC1ZK4(leader) starts synchronization with the followers in the cluster, 
> but here DIFF synchronization is used.
> {quote}From NC1ZK4's log
> 2025-12-28 22:46:47,815 [myid:4] - INFO 
> [LearnerHandler-/172.20.0.2:46746:LearnerHandler@805] - Synchronizing with 
> Learner sid: 1 maxCommittedLog=0x300000001 minCommittedLog=0x100000001 
> lastProcessedZxid=0x300000001 peerLastZxid=0x300000001
> 2025-12-28 22:46:47,815 [myid:4] - INFO 
> [LearnerHandler-/172.20.0.2:46746:LearnerHandler@850] - Sending DIFF 
> zxid=0x300000001 for peer sid: 1
> 2025-12-28 22:46:47,816 [myid:4] - INFO 
> [LearnerHandler-/172.20.0.4:41824:LearnerHandler@805] - Synchronizing with 
> Learner sid: 3 maxCommittedLog=0x300000001 minCommittedLog=0x100000001 
> lastProcessedZxid=0x300000001 peerLastZxid=0x300000001
> 2025-12-28 22:46:47,816 [myid:4] - INFO 
> [LearnerHandler-/172.20.0.4:41824:LearnerHandler@850] - Sending DIFF 
> zxid=0x300000001 for peer sid: 3
> 2025-12-28 22:46:47,896 [myid:4] - INFO 
> [LearnerHandler-/172.20.0.3:59386:LearnerHandler@805] - Synchronizing with 
> Learner sid: 2 maxCommittedLog=0x300000001 minCommittedLog=0x100000001 
> lastProcessedZxid=0x300000001 peerLastZxid=0x300000001
> 2025-12-28 22:46:47,896 [myid:4] - INFO 
> [LearnerHandler-/172.20.0.3:59386:LearnerHandler@850] - Sending DIFF 
> zxid=0x300000001 for peer sid: 2
> {quote}
>  # After synchronization:
>  ** Some followers still contain the ephemeral znode /eph
>  ** The leader and other servers do not contain /eph
> At this point, the cluster reaches a stable state with permanent data 
> divergence.
> --------------------------------------------------------------------------------------------------------------------------------
> After observing four nodes(NC1ZK5, which subsequently failed, was excluded 
> from the observation), it was found that he contents of the txn logs of 
> NC1ZK1, NC1ZK2 and NC1ZK4 are identical, but NC1ZK1 and NC1ZK4 have different 
> node lists.
> {quote}[error]FAV test has failed: "2025-12-28 22:47:03,744 [ZKChecker] - 
> INFO - server NC1ZK1:11181 and server NC1ZK4:11181 have different number of 
> znodes:[/zookeeper/quota, /bug, /eph, /zookeeper] [/zookeeper/quota, /bug, 
> /zookeeper]"
> {quote}
> ----



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to