lujingyu created ZOOKEEPER-5010:
-----------------------------------

             Summary: Leader DIFF synchronization does not remove stale 
ephemeral znodes on followers, causing permanent state divergence.
                 Key: ZOOKEEPER-5010
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5010
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.8.5
            Reporter: lujingyu
         Attachments: zookeeper--server-NC1ZK1.txt, 
zookeeper--server-NC1ZK2.txt, zookeeper--server-NC1ZK3.txt, 
zookeeper--server-NC1ZK4.txt, zookeeper--server-NC1ZK5.txt

We observed a persistent state divergence in a ZOOKEEPER cluster where an 
ephemeral znode remains present on some followers but is absent on the leader. 
This divergence persists across leader changes and does not self-heal.

The issue occurs in the scenario where the original leader suddenly crashes and 
the cluster undergoes a re-leader-election. The new leader synchronizes with 
all followers using DIFF, but some followers' in-memory states contain 
ephemeral nodes that are semantically invalid and have never been part of the 
leader's final committed history.

--------------------------------------------------------------------------------------------------------------------------------
The acutal testing scenario is as following:

The cluster has five nodes: NC1ZK1(172.20.0.2), NC1ZK2(172.20.0.3), 
NC1ZK3(172.20.0.4), NC1ZK4(172.20.0.5), NC1ZK5(172.20.0.6)
 # Start a ZOOKEEPER cluster with multiple servers.

 # A client(ZK1Cli) connects to the cluster and performs the following 
operations:

 ** create a persistent znode

 ** update the znode

 ** delete the znode

 ** request creation of an ephemeral znode /eph

 # After the processing of the ephemeral znode creation, the current 
leader(NC1ZK3) crashes while deserialize quorum packets
{quote}org.apache.zookeeper.server.quorum.QuorumPacket.deserialize$$PHOSPHORTAGGED(QuorumPacket.java:85),
 
org.apache.jute.BinaryInputArchive.readRecord$$PHOSPHORTAGGED(BinaryInputArchive.java:136),
 
org.apache.zookeeper.server.quorum.LearnerHandler.run$$PHOSPHORTAGGED(LearnerHandler.java:656),
 org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java)
{quote}
 # The client session terminates shortly after.

 # Multiple leader elections occurred in the cluster before a new 
client(ZK1Cli2) attempted to read the ephemeral znode /eph that should not have 
existed.

 # There is a special situation where, after a leader election, before the 
newly elected leader(NC1ZK5) crashes, a previously crashed node(NC1ZK3) rejoins 
the cluster and becomes a follower of the new leader(NC1ZK5).

 # And then the leader(NC1ZK5) is crashed.

 # After the cluster re-election, NC1ZK4 was elected as the new leader.

 # NC1ZK4(leader) starts synchronization with the followers in the cluster, but 
here DIFF synchronization is used.
{quote}From NC1ZK4's log

2025-12-28 22:46:47,815 [myid:4] - INFO 
[LearnerHandler-/172.20.0.2:46746:LearnerHandler@805] - Synchronizing with 
Learner sid: 1 maxCommittedLog=0x300000001 minCommittedLog=0x100000001 
lastProcessedZxid=0x300000001 peerLastZxid=0x300000001

2025-12-28 22:46:47,815 [myid:4] - INFO 
[LearnerHandler-/172.20.0.2:46746:LearnerHandler@850] - Sending DIFF 
zxid=0x300000001 for peer sid: 1

2025-12-28 22:46:47,816 [myid:4] - INFO 
[LearnerHandler-/172.20.0.4:41824:LearnerHandler@805] - Synchronizing with 
Learner sid: 3 maxCommittedLog=0x300000001 minCommittedLog=0x100000001 
lastProcessedZxid=0x300000001 peerLastZxid=0x300000001

2025-12-28 22:46:47,816 [myid:4] - INFO 
[LearnerHandler-/172.20.0.4:41824:LearnerHandler@850] - Sending DIFF 
zxid=0x300000001 for peer sid: 3

2025-12-28 22:46:47,896 [myid:4] - INFO 
[LearnerHandler-/172.20.0.3:59386:LearnerHandler@805] - Synchronizing with 
Learner sid: 2 maxCommittedLog=0x300000001 minCommittedLog=0x100000001 
lastProcessedZxid=0x300000001 peerLastZxid=0x300000001

2025-12-28 22:46:47,896 [myid:4] - INFO 
[LearnerHandler-/172.20.0.3:59386:LearnerHandler@850] - Sending DIFF 
zxid=0x300000001 for peer sid: 2
{quote}
 # After synchronization:

 ** Some followers still contain the ephemeral znode /eph

 ** The leader and other servers do not contain /eph

At this point, the cluster reaches a stable state with permanent data 
divergence.

--------------------------------------------------------------------------------------------------------------------------------

After observing four nodes(NC1ZK5, which subsequently failed, was excluded from 
the observation), it was found that he contents of the txn logs of NC1ZK1, 
NC1ZK2 and NC1ZK4 are identical, but NC1ZK1 and NC1ZK4 have different node 
lists.
{quote}[error]FAV test has failed: "2025-12-28 22:47:03,744 [ZKChecker] - INFO 
- server NC1ZK1:11181 and server NC1ZK4:11181 have different number of 
znodes:[/zookeeper/quota, /bug, /eph, /zookeeper] [/zookeeper/quota, /bug, 
/zookeeper]"
{quote}
----



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to