Folks,

        When I grep'd the ZK server logs for the session ID, I do see at the 
time that the connection was lost and reset the following message:

"Client attempting to renew session"

        So it looks like this is indeed the issue that the client reconnected 
and kept the same session. I suspect upgrading will not "fix" this issue, 
because it seems this is behaving as designed. I'll need to do some more 
research to understand how I can tell when a reconnect has triggered a new 
session versus resumed the original session. Checking for the existence of the 
znodes after reconnect won't work, because they could have been deleted and 
recreated by another app instance that has picked up the work on behalf of the 
disconnected instance. If I could see the client's sid and compare the new 
connection's sid to the old, I guess I could assume I'm still the owner of the 
znodes if they exist, but I'm not sure if the sid is exposed anywhere in the 
API (if it is, I haven't found it yet and would appreciate guidance). It would 
also be helpful if the znode's ephemeral owner ID were exposed in the client 
API, but I don't see that anywhere in the WatchedEvent API. I guess another 
possibility is that I have to append some information onto the znode's path 
that identifies the owner, but that would require a major change in our logic 
that would introduce a lot of additional complexity. Right now, each app will 
randomly try to grab work and register that it is handling the work by creating 
a well known path with the work's unique ID. Successful creation of the path 
means no other app is handling the work.

        If there is an easier way of managing all of this, please let me know. 
The point of using ZooKeeper was to delegate all the messiness of managing a 
distributed system, but if I have to have complicated logic to sense 
disconnects and then check for the existence of ephemeral znodes after a 
reconnect to know whether I'm still the owner of shared work, that isn't 
terribly helpful. Hopefully, I'm missing something obvious that makes this much 
easier.

Paul

> On Oct 30, 2020, at 1:10 PM, Paul Summermatter <[email protected]> wrote:
> 
> Enrico,
> 
>       Thank you very much for the incredibly rapid reply. I just discovered 
> that I can indeed correlate the ephemeral owner ID with a sessions "sid" 
> using the 'cons' command. I discovered that one of the three ZK instances 
> thinks there is a session with that ID.
> 
>       Do you or anyone else happen to know if ZK has any issues (either in 
> the current or older versions) where a session will not be terminated if the 
> client reconnects within a relatively short period of time? I don't know how 
> exactly ZK identifies a session or whether the ZK client is trying to be 
> helpful and attempts to maintain the session when it reconnects by providing 
> the prior session ID in the new connection request, preventing the ephemeral 
> nodes from being deleted as I want/expect.
> 
>       Given our lengthy testing cycle and the fact that we're nearing the 
> holidays, upgrading ZK won't be possible until next year, but we will 
> definitely look into it. My only concern is if this is ZK's expected behavior 
> for some reason, upgrading won't solve the issue.
> 
> Regards,
> Paul
> 
>> On Oct 30, 2020, at 12:34 PM, Enrico Olivelli <[email protected]> wrote:
>> 
>> Paul,
>> do you have a way to upgrade to the latest ZK 3.6.2 ?
>> many things changed since 3.4.6, it is a pretty old version
>> 
>> The session is declared as "expired" on the server side, and this will in
>> turn trigger the deletion of the ephemeral nodes, if they aren't deleted
>> the session is still active from the servers point of view or there is some
>> kind of bug
>> 
>> Enrico
>> 
>> 
>> Il giorno ven 30 ott 2020 alle ore 17:30 Paul Summermatter
>> <[email protected]> ha scritto:
>> 
>>> RE: ZooKeeper 3.4.6
>>> 
>>> All,
>>> 
>>>       I'm trying to troubleshoot a problem and could use some guidance
>>> from the experts on ZK administration. I have a cluster of applications
>>> that share work and that create ephemeral nodes representing the work in ZK
>>> expressly so that, if one application fails, the ephemeral nodes should be
>>> deleted, and the other apps should be able to pick up the work that is now
>>> not being completed by the failed instance.
>>> 
>>>       Yesterday evening, one application instance suffered from some
>>> severe memory pressure and had to run multiple stop the world GC cycles.
>>> The pauses appear to have triggered a SessionExpiredException in
>>> org.apache.zookeeper.ClientCnxn$SendThread.run (I correlated multiple
>>> "Pause Full" statements in the GC logs with the ZK session timeout in the
>>> application logs). After the timeout, the connection was re-established in
>>> under 1,000ms, but the ephemeral nodes remained in ZooKeeper, leaving them
>>> as orphans. We've seen this behavior before and have had to delete the
>>> nodes manually using the zkCli.sh utility.
>>> 
>>>       In an attempt to troubleshoot this issue, I'm trying to correlate
>>> the ephemeral owner that is listed on a node when you run the 'get' command
>>> with the ID of an active session. Basically, I'm trying to understand
>>> whether ZK thinks there is still an active session associated with the
>>> ephemeral node in the hopes that that might lead to an explanation for why
>>> the ZK server didn't seem to recognize the session timeout sensed on the
>>> client that triggered a new connection and would explain why the ephemeral
>>> nodes were not deleted as they should have been when the connection dropped.
>>> 
>>>       I've tried the various four letter commands on the server to see
>>> if any of them output anything that looks like the ephemeral owner ID
>>> without any success. Any suggestions/guidance would be greatly appreciated.
>>> Note, right now, upgrading is not an option, but I'm certainly open to that
>>> if there are known issues with ephemeral nodes in 3.4 that are addressed in
>>> newer versions.
>>> 
>>> Regards,
>>> Paul
> 

Reply via email to