[
https://issues.apache.org/jira/browse/ZOOKEEPER-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050796#comment-18050796
]
lujingyu commented on ZOOKEEPER-5010:
-------------------------------------
If possible, could someone please help me confirm whether this is a new bug
different from previous ones? I would really appreciate it!
> Leader DIFF synchronization does not remove stale ephemeral znodes on
> followers, causing permanent state divergence.
> --------------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-5010
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5010
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.8.5
> Reporter: lujingyu
> Priority: Major
> Attachments: zookeeper--server-NC1ZK1.txt,
> zookeeper--server-NC1ZK2.txt, zookeeper--server-NC1ZK3.txt,
> zookeeper--server-NC1ZK4.txt, zookeeper--server-NC1ZK5.txt
>
>
> We observed a persistent state divergence in a ZOOKEEPER cluster where an
> ephemeral znode remains present on some followers but is absent on the
> leader. This divergence persists across leader changes and does not self-heal.
> The issue occurs in the scenario where the original leader suddenly crashes
> and the cluster undergoes a re-leader-election. The new leader synchronizes
> with all followers using DIFF, but some followers' in-memory states contain
> ephemeral nodes that are semantically invalid and have never been part of the
> leader's final committed history.
> --------------------------------------------------------------------------------------------------------------------------------
> The acutal testing scenario is as following:
> The cluster has five nodes: NC1ZK1(172.20.0.2), NC1ZK2(172.20.0.3),
> NC1ZK3(172.20.0.4), NC1ZK4(172.20.0.5), NC1ZK5(172.20.0.6)
> # Start a ZOOKEEPER cluster with multiple servers.
> # A client(ZK1Cli) connects to the cluster and performs the following
> operations:
> ** create a persistent znode
> ** update the znode
> ** delete the znode
> ** request creation of an ephemeral znode /eph
> # After the processing of the ephemeral znode creation, the current
> leader(NC1ZK3) crashes while deserialize quorum packets
> {quote}org.apache.zookeeper.server.quorum.QuorumPacket.deserialize$$PHOSPHORTAGGED(QuorumPacket.java:85),
>
> org.apache.jute.BinaryInputArchive.readRecord$$PHOSPHORTAGGED(BinaryInputArchive.java:136),
>
> org.apache.zookeeper.server.quorum.LearnerHandler.run$$PHOSPHORTAGGED(LearnerHandler.java:656),
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java)
> {quote}
> # The client session terminates shortly after.
> # Multiple leader elections occurred in the cluster before a new
> client(ZK1Cli2) attempted to read the ephemeral znode /eph that should not
> have existed.
> # There is a special situation where, after a leader election, before the
> newly elected leader(NC1ZK5) crashes, a previously crashed node(NC1ZK3)
> rejoins the cluster and becomes a follower of the new leader(NC1ZK5).
> # And then the leader(NC1ZK5) is crashed.
> # After the cluster re-election, NC1ZK4 was elected as the new leader.
> # NC1ZK4(leader) starts synchronization with the followers in the cluster,
> but here DIFF synchronization is used.
> {quote}From NC1ZK4's log
> 2025-12-28 22:46:47,815 [myid:4] - INFO
> [LearnerHandler-/172.20.0.2:46746:LearnerHandler@805] - Synchronizing with
> Learner sid: 1 maxCommittedLog=0x300000001 minCommittedLog=0x100000001
> lastProcessedZxid=0x300000001 peerLastZxid=0x300000001
> 2025-12-28 22:46:47,815 [myid:4] - INFO
> [LearnerHandler-/172.20.0.2:46746:LearnerHandler@850] - Sending DIFF
> zxid=0x300000001 for peer sid: 1
> 2025-12-28 22:46:47,816 [myid:4] - INFO
> [LearnerHandler-/172.20.0.4:41824:LearnerHandler@805] - Synchronizing with
> Learner sid: 3 maxCommittedLog=0x300000001 minCommittedLog=0x100000001
> lastProcessedZxid=0x300000001 peerLastZxid=0x300000001
> 2025-12-28 22:46:47,816 [myid:4] - INFO
> [LearnerHandler-/172.20.0.4:41824:LearnerHandler@850] - Sending DIFF
> zxid=0x300000001 for peer sid: 3
> 2025-12-28 22:46:47,896 [myid:4] - INFO
> [LearnerHandler-/172.20.0.3:59386:LearnerHandler@805] - Synchronizing with
> Learner sid: 2 maxCommittedLog=0x300000001 minCommittedLog=0x100000001
> lastProcessedZxid=0x300000001 peerLastZxid=0x300000001
> 2025-12-28 22:46:47,896 [myid:4] - INFO
> [LearnerHandler-/172.20.0.3:59386:LearnerHandler@850] - Sending DIFF
> zxid=0x300000001 for peer sid: 2
> {quote}
> # After synchronization:
> ** Some followers still contain the ephemeral znode /eph
> ** The leader and other servers do not contain /eph
> At this point, the cluster reaches a stable state with permanent data
> divergence.
> --------------------------------------------------------------------------------------------------------------------------------
> After observing four nodes(NC1ZK5, which subsequently failed, was excluded
> from the observation), it was found that he contents of the txn logs of
> NC1ZK1, NC1ZK2 and NC1ZK4 are identical, but NC1ZK1 and NC1ZK4 have different
> node lists.
> {quote}[error]FAV test has failed: "2025-12-28 22:47:03,744 [ZKChecker] -
> INFO - server NC1ZK1:11181 and server NC1ZK4:11181 have different number of
> znodes:[/zookeeper/quota, /bug, /eph, /zookeeper] [/zookeeper/quota, /bug,
> /zookeeper]"
> {quote}
> ----
--
This message was sent by Atlassian Jira
(v8.20.10#820010)