Also could see 3.8.4 is affected by CVE-2024-47554. On Tue, Jun 24, 2025 at 10:33 AM arjun s v <[email protected]> wrote:
> Andor, > > Thank you for your response. > > We are uncertain about the specific conditions triggering the issue, > making it difficult to predict its occurrence in version 3.8.4. > > As multiple critical modules, including Hadoop and Kafka, rely on > Zookeeper, any disruption could lead to potential data loss on our end. > > Identifying the exact root cause or scenario would help us both to either > fix or mitigate the issue. > > I will provide any additional logs if needed. > > Could you please clarify the specific issue and confirm if a fix is > available in version 3.8.4? > > On Mon, Jun 23, 2025 at 9:27 PM Andor Molnar <[email protected]> wrote: > >> Hi Arjun, >> >> Could you please validate the same scenario with latest stable version >> 3.8.4? >> >> Andor >> >> >> >> >> > On Jun 23, 2025, at 00:09, arjun s v <[email protected]> wrote: >> > >> > Continuation to the Ephemeral node issue, >> > >> > I observed that the learner sends ACKs for each packet it receives, but >> > there seems to be no verification on the leader's side to confirm these >> > ACKs against the packets sent. >> > Is there a configuration option that, when enabled, ensures all packet >> > ACKs, including COMMIT ACKs, are validated? >> > If packet loss is the reason for this issue, verifying all received ACKs >> > against the sent packets could help prevent such problems in the future. >> > >> > Please advise. >> > >> > On Thu, Jun 19, 2025 at 6:35 PM arjun s v <[email protected]> >> wrote: >> > >> >> Team, >> >> >> >> I'm investigating an issue where an ephemeral node in ZooKeeper was not >> >> properly managed after a server rejoined the ensemble. My setup uses >> >> ZooKeeper 3.9.3. Below is the timeline of events: >> >> >> >> Timeline: >> >> >> >> 1. An ephemeral node is created. >> >> 2. This is synced across all servers in the ensemble. >> >> 3. Follower 'A' goes out of the ensemble due to a connectivity issue. >> >> 4. Now the client session associated with the ephemeral node >> >> disconnects, deleting the ephemeral node across all active servers in >> the >> >> ensemble. >> >> 5. A new client session is initiated, creating another ephemeral node >> >> with the same path. >> >> 6. This new ephemeral node is synced across all active servers in the >> >> ensemble. >> >> 7. Follower 'A' rejoins the ensemble. >> >> 8. The leader syncs the latest commits to follower 'A'. >> >> 9. However, (Ephemeral Node).getEphemeralOwner() does not return the >> >> current session's session ID. >> >> >> >> >> >> I couldn't confirm if an old ephemeral node persisted, as the machine >> was >> >> restarted, resolving the issue. Debug logs were not enabled, so no >> >> additional logs are available to confirm the root cause. I suspect >> packet >> >> loss during the rejoin may have contributed. Attached are the >> >> leader-to-follower sync logs. >> >> >> >> Could you please advise if there are known issues with ephemeral node >> >> cleanup during server rejoins, or other scenarios to check? Is this >> likely >> >> due to packet loss or synchronization issues? >> >> >> >> Thanks in advance! >> >> >> >> >> >> Sync logs: >> >> >> >>> Follower >> >>> >> >>> 11:29:25:853 >> >>> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run >> >>> Successfully connected to leader, using address: <DOMAIN/IPADDR>:2888 >> >>> >> >>> 11:29:25:854 org.apache.zookeeper.util.SecurityUtils.createSaslClient >> >>> QuorumLearner will use DIGEST-MD5 as SASL mechanism. >> >>> >> >>> 11:29:25:859 >> >>> >> org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthLearner.checkAuthStatus >> >>> Successfully completed the authentication using SASL. server addr: >> >>> <DOMAIN/IPADDR>:2888, status: SUCCESS >> >>> >> >>> 11:29:25:864 org.apache.zookeeper.server.quorum.QuorumPeer.setZabState >> >>> Peer state changed: following - synchronization >> >>> >> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >> >>> Getting a diff from the leader 0x240025c7cc >> >>> >> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >> >>> Got zxid 0x240025c7cc expected 0x1 >> >>> >> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >> >>> Learner received NEWLEADER message >> >>> >> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode >> >>> Peer state changed: following - synchronization - diff >> >>> >> >>> 11:29:25:870 >> >>> >> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier >> >>> Dynamic reconfig is disabled, we don't store the last seen config. >> >>> >> >>> 11:29:25:871 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >> >>> It took 2ms to persist and commit txns in packetsCommitted. 0 >> outstanding >> >>> txns left in packetsNotLogged >> >>> >> >>> 11:29:25:873 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >> >>> Set the current epoch to 37 >> >>> >> >>> 11:29:25:874 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode >> >>> Peer state changed: following - synchronization >> >>> >> >>> 11:29:25:874 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >> >>> Sent NEWLEADER ack to leader with zxid 2500000000 >> >>> >> >>> 11:29:25:879 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >> >>> Learner received UPTODATE message >> >>> >> >>> >> >>> >> >>> Leader: >> >>> >> >>> 11:29:25:864 org.apache.zookeeper.server.quorum.LearnerHandler.run >> >>> Follower sid: 2 : info : <ADDR>:2888:3888:participant >> >>> >> >>> 11:29:25:868 >> org.apache.zookeeper.server.ZKDatabase.isTxnLogSyncEnabled >> >>> On disk txn sync enabled with snapshotSizeFactor 0.33 >> >>> >> >>> 11:29:25:868 >> >>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower >> >>> Synchronizing with Learner sid: 2 maxCommittedLog=0x240025c7cc >> >>> minCommittedLog=0x240025c5d0 lastProcessedZxid=0x240025c7cc >> >>> peerLastZxid=0x240025c7c3 >> >>> >> >>> 11:29:25:869 >> >>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower >> >>> Using committedLog for peer sid: 2 >> >>> >> >>> 11:29:25:870 >> >>> >> org.apache.zookeeper.server.quorum.LearnerHandler.queueCommittedProposals >> >>> Sending DIFF zxid=0x240025c7cc for peer sid: 2 >> >> >> >> >> >>
