Hi Arjun,

Could you please validate the same scenario with latest stable version 3.8.4?

Andor




> On Jun 23, 2025, at 00:09, arjun s v <[email protected]> wrote:
> 
> Continuation to the Ephemeral node issue,
> 
> I observed that the learner sends ACKs for each packet it receives, but
> there seems to be no verification on the leader's side to confirm these
> ACKs against the packets sent.
> Is there a configuration option that, when enabled, ensures all packet
> ACKs, including COMMIT ACKs, are validated?
> If packet loss is the reason for this issue, verifying all received ACKs
> against the sent packets could help prevent such problems in the future.
> 
> Please advise.
> 
> On Thu, Jun 19, 2025 at 6:35 PM arjun s v <[email protected]> wrote:
> 
>> Team,
>> 
>> I'm investigating an issue where an ephemeral node in ZooKeeper was not
>> properly managed after a server rejoined the ensemble. My setup uses
>> ZooKeeper 3.9.3. Below is the timeline of events:
>> 
>> Timeline:
>> 
>>  1. An ephemeral node is created.
>>  2. This is synced across all servers in the ensemble.
>>  3. Follower 'A' goes out of the ensemble due to a connectivity issue.
>>  4. Now the client session associated with the ephemeral node
>>  disconnects, deleting the ephemeral node across all active servers in the
>>  ensemble.
>>  5. A new client session is initiated, creating another ephemeral node
>>  with the same path.
>>  6. This new ephemeral node is synced across all active servers in the
>>  ensemble.
>>  7. Follower 'A' rejoins the ensemble.
>>  8. The leader syncs the latest commits to follower 'A'.
>>  9. However, (Ephemeral Node).getEphemeralOwner() does not return the
>>  current session's session ID.
>> 
>> 
>> I couldn't confirm if an old ephemeral node persisted, as the machine was
>> restarted, resolving the issue. Debug logs were not enabled, so no
>> additional logs are available to confirm the root cause. I suspect packet
>> loss during the rejoin may have contributed. Attached are the
>> leader-to-follower sync logs.
>> 
>> Could you please advise if there are known issues with ephemeral node
>> cleanup during server rejoins, or other scenarios to check? Is this likely
>> due to packet loss or synchronization issues?
>> 
>> Thanks in advance!
>> 
>> 
>> Sync logs:
>> 
>>> Follower
>>> 
>>> 11:29:25:853
>>> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run
>>> Successfully connected to leader, using address: <DOMAIN/IPADDR>:2888
>>> 
>>> 11:29:25:854 org.apache.zookeeper.util.SecurityUtils.createSaslClient
>>> QuorumLearner will use DIGEST-MD5 as SASL mechanism.
>>> 
>>> 11:29:25:859
>>> org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthLearner.checkAuthStatus
>>> Successfully completed the authentication using SASL. server addr:
>>> <DOMAIN/IPADDR>:2888, status: SUCCESS
>>> 
>>> 11:29:25:864 org.apache.zookeeper.server.quorum.QuorumPeer.setZabState
>>> Peer state changed: following - synchronization
>>> 
>>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>>> Getting a diff from the leader 0x240025c7cc
>>> 
>>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>>> Got zxid 0x240025c7cc expected 0x1
>>> 
>>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>>> Learner received NEWLEADER message
>>> 
>>> 11:29:25:869 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode
>>> Peer state changed: following - synchronization - diff
>>> 
>>> 11:29:25:870
>>> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier
>>> Dynamic reconfig is disabled, we don't store the last seen config.
>>> 
>>> 11:29:25:871 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>>> It took 2ms to persist and commit txns in packetsCommitted. 0 outstanding
>>> txns left in packetsNotLogged
>>> 
>>> 11:29:25:873 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>>> Set the current epoch to 37
>>> 
>>> 11:29:25:874 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode
>>> Peer state changed: following - synchronization
>>> 
>>> 11:29:25:874 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>>> Sent NEWLEADER ack to leader with zxid 2500000000
>>> 
>>> 11:29:25:879 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>>> Learner received UPTODATE message
>>> 
>>> 
>>> 
>>> Leader:
>>> 
>>> 11:29:25:864 org.apache.zookeeper.server.quorum.LearnerHandler.run
>>> Follower sid: 2 : info : <ADDR>:2888:3888:participant
>>> 
>>> 11:29:25:868 org.apache.zookeeper.server.ZKDatabase.isTxnLogSyncEnabled
>>> On disk txn sync enabled with snapshotSizeFactor 0.33
>>> 
>>> 11:29:25:868
>>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower
>>> Synchronizing with Learner sid: 2 maxCommittedLog=0x240025c7cc
>>> minCommittedLog=0x240025c5d0 lastProcessedZxid=0x240025c7cc
>>> peerLastZxid=0x240025c7c3
>>> 
>>> 11:29:25:869
>>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower
>>> Using committedLog for peer sid: 2
>>> 
>>> 11:29:25:870
>>> org.apache.zookeeper.server.quorum.LearnerHandler.queueCommittedProposals
>>> Sending DIFF zxid=0x240025c7cc for peer sid: 2
>> 
>> 

Reply via email to