lujingyu created ZOOKEEPER-5010:
-----------------------------------
Summary: Leader DIFF synchronization does not remove stale
ephemeral znodes on followers, causing permanent state divergence.
Key: ZOOKEEPER-5010
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5010
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.8.5
Reporter: lujingyu
Attachments: zookeeper--server-NC1ZK1.txt,
zookeeper--server-NC1ZK2.txt, zookeeper--server-NC1ZK3.txt,
zookeeper--server-NC1ZK4.txt, zookeeper--server-NC1ZK5.txt
We observed a persistent state divergence in a ZOOKEEPER cluster where an
ephemeral znode remains present on some followers but is absent on the leader.
This divergence persists across leader changes and does not self-heal.
The issue occurs in the scenario where the original leader suddenly crashes and
the cluster undergoes a re-leader-election. The new leader synchronizes with
all followers using DIFF, but some followers' in-memory states contain
ephemeral nodes that are semantically invalid and have never been part of the
leader's final committed history.
--------------------------------------------------------------------------------------------------------------------------------
The acutal testing scenario is as following:
The cluster has five nodes: NC1ZK1(172.20.0.2), NC1ZK2(172.20.0.3),
NC1ZK3(172.20.0.4), NC1ZK4(172.20.0.5), NC1ZK5(172.20.0.6)
# Start a ZOOKEEPER cluster with multiple servers.
# A client(ZK1Cli) connects to the cluster and performs the following
operations:
** create a persistent znode
** update the znode
** delete the znode
** request creation of an ephemeral znode /eph
# After the processing of the ephemeral znode creation, the current
leader(NC1ZK3) crashes while deserialize quorum packets
{quote}org.apache.zookeeper.server.quorum.QuorumPacket.deserialize$$PHOSPHORTAGGED(QuorumPacket.java:85),
org.apache.jute.BinaryInputArchive.readRecord$$PHOSPHORTAGGED(BinaryInputArchive.java:136),
org.apache.zookeeper.server.quorum.LearnerHandler.run$$PHOSPHORTAGGED(LearnerHandler.java:656),
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java)
{quote}
# The client session terminates shortly after.
# Multiple leader elections occurred in the cluster before a new
client(ZK1Cli2) attempted to read the ephemeral znode /eph that should not have
existed.
# There is a special situation where, after a leader election, before the
newly elected leader(NC1ZK5) crashes, a previously crashed node(NC1ZK3) rejoins
the cluster and becomes a follower of the new leader(NC1ZK5).
# And then the leader(NC1ZK5) is crashed.
# After the cluster re-election, NC1ZK4 was elected as the new leader.
# NC1ZK4(leader) starts synchronization with the followers in the cluster, but
here DIFF synchronization is used.
{quote}From NC1ZK4's log
2025-12-28 22:46:47,815 [myid:4] - INFO
[LearnerHandler-/172.20.0.2:46746:LearnerHandler@805] - Synchronizing with
Learner sid: 1 maxCommittedLog=0x300000001 minCommittedLog=0x100000001
lastProcessedZxid=0x300000001 peerLastZxid=0x300000001
2025-12-28 22:46:47,815 [myid:4] - INFO
[LearnerHandler-/172.20.0.2:46746:LearnerHandler@850] - Sending DIFF
zxid=0x300000001 for peer sid: 1
2025-12-28 22:46:47,816 [myid:4] - INFO
[LearnerHandler-/172.20.0.4:41824:LearnerHandler@805] - Synchronizing with
Learner sid: 3 maxCommittedLog=0x300000001 minCommittedLog=0x100000001
lastProcessedZxid=0x300000001 peerLastZxid=0x300000001
2025-12-28 22:46:47,816 [myid:4] - INFO
[LearnerHandler-/172.20.0.4:41824:LearnerHandler@850] - Sending DIFF
zxid=0x300000001 for peer sid: 3
2025-12-28 22:46:47,896 [myid:4] - INFO
[LearnerHandler-/172.20.0.3:59386:LearnerHandler@805] - Synchronizing with
Learner sid: 2 maxCommittedLog=0x300000001 minCommittedLog=0x100000001
lastProcessedZxid=0x300000001 peerLastZxid=0x300000001
2025-12-28 22:46:47,896 [myid:4] - INFO
[LearnerHandler-/172.20.0.3:59386:LearnerHandler@850] - Sending DIFF
zxid=0x300000001 for peer sid: 2
{quote}
# After synchronization:
** Some followers still contain the ephemeral znode /eph
** The leader and other servers do not contain /eph
At this point, the cluster reaches a stable state with permanent data
divergence.
--------------------------------------------------------------------------------------------------------------------------------
After observing four nodes(NC1ZK5, which subsequently failed, was excluded from
the observation), it was found that he contents of the txn logs of NC1ZK1,
NC1ZK2 and NC1ZK4 are identical, but NC1ZK1 and NC1ZK4 have different node
lists.
{quote}[error]FAV test has failed: "2025-12-28 22:47:03,744 [ZKChecker] - INFO
- server NC1ZK1:11181 and server NC1ZK4:11181 have different number of
znodes:[/zookeeper/quota, /bug, /eph, /zookeeper] [/zookeeper/quota, /bug,
/zookeeper]"
{quote}
----
--
This message was sent by Atlassian Jira
(v8.20.10#820010)