Also could see 3.8.4 is affected by CVE-2024-47554.

On Tue, Jun 24, 2025 at 10:33 AM arjun s v <[email protected]> wrote:

> Andor,
>
> Thank you for your response.
>
> We are uncertain about the specific conditions triggering the issue,
> making it difficult to predict its occurrence in version 3.8.4.
>
> As multiple critical modules, including Hadoop and Kafka, rely on
> Zookeeper, any disruption could lead to potential data loss on our end.
>
> Identifying the exact root cause or scenario would help us both to either
> fix or mitigate the issue.
>
> I will provide any additional logs if needed.
>
> Could you please clarify the specific issue and confirm if a fix is
> available in version 3.8.4?
>
> On Mon, Jun 23, 2025 at 9:27 PM Andor Molnar <[email protected]> wrote:
>
>> Hi Arjun,
>>
>> Could you please validate the same scenario with latest stable version
>> 3.8.4?
>>
>> Andor
>>
>>
>>
>>
>> > On Jun 23, 2025, at 00:09, arjun s v <[email protected]> wrote:
>> >
>> > Continuation to the Ephemeral node issue,
>> >
>> > I observed that the learner sends ACKs for each packet it receives, but
>> > there seems to be no verification on the leader's side to confirm these
>> > ACKs against the packets sent.
>> > Is there a configuration option that, when enabled, ensures all packet
>> > ACKs, including COMMIT ACKs, are validated?
>> > If packet loss is the reason for this issue, verifying all received ACKs
>> > against the sent packets could help prevent such problems in the future.
>> >
>> > Please advise.
>> >
>> > On Thu, Jun 19, 2025 at 6:35 PM arjun s v <[email protected]>
>> wrote:
>> >
>> >> Team,
>> >>
>> >> I'm investigating an issue where an ephemeral node in ZooKeeper was not
>> >> properly managed after a server rejoined the ensemble. My setup uses
>> >> ZooKeeper 3.9.3. Below is the timeline of events:
>> >>
>> >> Timeline:
>> >>
>> >>  1. An ephemeral node is created.
>> >>  2. This is synced across all servers in the ensemble.
>> >>  3. Follower 'A' goes out of the ensemble due to a connectivity issue.
>> >>  4. Now the client session associated with the ephemeral node
>> >>  disconnects, deleting the ephemeral node across all active servers in
>> the
>> >>  ensemble.
>> >>  5. A new client session is initiated, creating another ephemeral node
>> >>  with the same path.
>> >>  6. This new ephemeral node is synced across all active servers in the
>> >>  ensemble.
>> >>  7. Follower 'A' rejoins the ensemble.
>> >>  8. The leader syncs the latest commits to follower 'A'.
>> >>  9. However, (Ephemeral Node).getEphemeralOwner() does not return the
>> >>  current session's session ID.
>> >>
>> >>
>> >> I couldn't confirm if an old ephemeral node persisted, as the machine
>> was
>> >> restarted, resolving the issue. Debug logs were not enabled, so no
>> >> additional logs are available to confirm the root cause. I suspect
>> packet
>> >> loss during the rejoin may have contributed. Attached are the
>> >> leader-to-follower sync logs.
>> >>
>> >> Could you please advise if there are known issues with ephemeral node
>> >> cleanup during server rejoins, or other scenarios to check? Is this
>> likely
>> >> due to packet loss or synchronization issues?
>> >>
>> >> Thanks in advance!
>> >>
>> >>
>> >> Sync logs:
>> >>
>> >>> Follower
>> >>>
>> >>> 11:29:25:853
>> >>> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run
>> >>> Successfully connected to leader, using address: <DOMAIN/IPADDR>:2888
>> >>>
>> >>> 11:29:25:854 org.apache.zookeeper.util.SecurityUtils.createSaslClient
>> >>> QuorumLearner will use DIGEST-MD5 as SASL mechanism.
>> >>>
>> >>> 11:29:25:859
>> >>>
>> org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthLearner.checkAuthStatus
>> >>> Successfully completed the authentication using SASL. server addr:
>> >>> <DOMAIN/IPADDR>:2888, status: SUCCESS
>> >>>
>> >>> 11:29:25:864 org.apache.zookeeper.server.quorum.QuorumPeer.setZabState
>> >>> Peer state changed: following - synchronization
>> >>>
>> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>> >>> Getting a diff from the leader 0x240025c7cc
>> >>>
>> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>> >>> Got zxid 0x240025c7cc expected 0x1
>> >>>
>> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>> >>> Learner received NEWLEADER message
>> >>>
>> >>> 11:29:25:869 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode
>> >>> Peer state changed: following - synchronization - diff
>> >>>
>> >>> 11:29:25:870
>> >>>
>> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier
>> >>> Dynamic reconfig is disabled, we don't store the last seen config.
>> >>>
>> >>> 11:29:25:871 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>> >>> It took 2ms to persist and commit txns in packetsCommitted. 0
>> outstanding
>> >>> txns left in packetsNotLogged
>> >>>
>> >>> 11:29:25:873 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>> >>> Set the current epoch to 37
>> >>>
>> >>> 11:29:25:874 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode
>> >>> Peer state changed: following - synchronization
>> >>>
>> >>> 11:29:25:874 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>> >>> Sent NEWLEADER ack to leader with zxid 2500000000
>> >>>
>> >>> 11:29:25:879 org.apache.zookeeper.server.quorum.Learner.syncWithLeader
>> >>> Learner received UPTODATE message
>> >>>
>> >>>
>> >>>
>> >>> Leader:
>> >>>
>> >>> 11:29:25:864 org.apache.zookeeper.server.quorum.LearnerHandler.run
>> >>> Follower sid: 2 : info : <ADDR>:2888:3888:participant
>> >>>
>> >>> 11:29:25:868
>> org.apache.zookeeper.server.ZKDatabase.isTxnLogSyncEnabled
>> >>> On disk txn sync enabled with snapshotSizeFactor 0.33
>> >>>
>> >>> 11:29:25:868
>> >>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower
>> >>> Synchronizing with Learner sid: 2 maxCommittedLog=0x240025c7cc
>> >>> minCommittedLog=0x240025c5d0 lastProcessedZxid=0x240025c7cc
>> >>> peerLastZxid=0x240025c7c3
>> >>>
>> >>> 11:29:25:869
>> >>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower
>> >>> Using committedLog for peer sid: 2
>> >>>
>> >>> 11:29:25:870
>> >>>
>> org.apache.zookeeper.server.quorum.LearnerHandler.queueCommittedProposals
>> >>> Sending DIFF zxid=0x240025c7cc for peer sid: 2
>> >>
>> >>
>>
>>

Reply via email to