Hi Arjun, Could you please validate the same scenario with latest stable version 3.8.4?
Andor > On Jun 23, 2025, at 00:09, arjun s v <[email protected]> wrote: > > Continuation to the Ephemeral node issue, > > I observed that the learner sends ACKs for each packet it receives, but > there seems to be no verification on the leader's side to confirm these > ACKs against the packets sent. > Is there a configuration option that, when enabled, ensures all packet > ACKs, including COMMIT ACKs, are validated? > If packet loss is the reason for this issue, verifying all received ACKs > against the sent packets could help prevent such problems in the future. > > Please advise. > > On Thu, Jun 19, 2025 at 6:35 PM arjun s v <[email protected]> wrote: > >> Team, >> >> I'm investigating an issue where an ephemeral node in ZooKeeper was not >> properly managed after a server rejoined the ensemble. My setup uses >> ZooKeeper 3.9.3. Below is the timeline of events: >> >> Timeline: >> >> 1. An ephemeral node is created. >> 2. This is synced across all servers in the ensemble. >> 3. Follower 'A' goes out of the ensemble due to a connectivity issue. >> 4. Now the client session associated with the ephemeral node >> disconnects, deleting the ephemeral node across all active servers in the >> ensemble. >> 5. A new client session is initiated, creating another ephemeral node >> with the same path. >> 6. This new ephemeral node is synced across all active servers in the >> ensemble. >> 7. Follower 'A' rejoins the ensemble. >> 8. The leader syncs the latest commits to follower 'A'. >> 9. However, (Ephemeral Node).getEphemeralOwner() does not return the >> current session's session ID. >> >> >> I couldn't confirm if an old ephemeral node persisted, as the machine was >> restarted, resolving the issue. Debug logs were not enabled, so no >> additional logs are available to confirm the root cause. I suspect packet >> loss during the rejoin may have contributed. Attached are the >> leader-to-follower sync logs. >> >> Could you please advise if there are known issues with ephemeral node >> cleanup during server rejoins, or other scenarios to check? Is this likely >> due to packet loss or synchronization issues? >> >> Thanks in advance! >> >> >> Sync logs: >> >>> Follower >>> >>> 11:29:25:853 >>> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run >>> Successfully connected to leader, using address: <DOMAIN/IPADDR>:2888 >>> >>> 11:29:25:854 org.apache.zookeeper.util.SecurityUtils.createSaslClient >>> QuorumLearner will use DIGEST-MD5 as SASL mechanism. >>> >>> 11:29:25:859 >>> org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthLearner.checkAuthStatus >>> Successfully completed the authentication using SASL. server addr: >>> <DOMAIN/IPADDR>:2888, status: SUCCESS >>> >>> 11:29:25:864 org.apache.zookeeper.server.quorum.QuorumPeer.setZabState >>> Peer state changed: following - synchronization >>> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >>> Getting a diff from the leader 0x240025c7cc >>> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >>> Got zxid 0x240025c7cc expected 0x1 >>> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >>> Learner received NEWLEADER message >>> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode >>> Peer state changed: following - synchronization - diff >>> >>> 11:29:25:870 >>> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier >>> Dynamic reconfig is disabled, we don't store the last seen config. >>> >>> 11:29:25:871 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >>> It took 2ms to persist and commit txns in packetsCommitted. 0 outstanding >>> txns left in packetsNotLogged >>> >>> 11:29:25:873 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >>> Set the current epoch to 37 >>> >>> 11:29:25:874 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode >>> Peer state changed: following - synchronization >>> >>> 11:29:25:874 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >>> Sent NEWLEADER ack to leader with zxid 2500000000 >>> >>> 11:29:25:879 org.apache.zookeeper.server.quorum.Learner.syncWithLeader >>> Learner received UPTODATE message >>> >>> >>> >>> Leader: >>> >>> 11:29:25:864 org.apache.zookeeper.server.quorum.LearnerHandler.run >>> Follower sid: 2 : info : <ADDR>:2888:3888:participant >>> >>> 11:29:25:868 org.apache.zookeeper.server.ZKDatabase.isTxnLogSyncEnabled >>> On disk txn sync enabled with snapshotSizeFactor 0.33 >>> >>> 11:29:25:868 >>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower >>> Synchronizing with Learner sid: 2 maxCommittedLog=0x240025c7cc >>> minCommittedLog=0x240025c5d0 lastProcessedZxid=0x240025c7cc >>> peerLastZxid=0x240025c7c3 >>> >>> 11:29:25:869 >>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower >>> Using committedLog for peer sid: 2 >>> >>> 11:29:25:870 >>> org.apache.zookeeper.server.quorum.LearnerHandler.queueCommittedProposals >>> Sending DIFF zxid=0x240025c7cc for peer sid: 2 >> >>
