>> but I'm not sure if the sid is exposed anywhere in the API (if it is, I haven't found it yet and would appreciate guidance).
The session id can be retrieved through the Stat object passed to various ZooKeeper APIs (like getData) - once you get a Stat object call getEphemeralOwner would return the id of the session that owns the node. Alternatively, as Raúl pointed out, zk-shell is an excellent tool to obtain the same information. I'd also echo what Enrico pointed out on the version upgrades - we had quite a few ephemeral nodes related bugs and you could hit one of them in your case. On Fri, Oct 30, 2020 at 10:46 AM Paul Summermatter <[email protected]> wrote: > Folks, > > When I grep'd the ZK server logs for the session ID, I do see at > the time that the connection was lost and reset the following message: > > "Client attempting to renew session" > > So it looks like this is indeed the issue that the client > reconnected and kept the same session. I suspect upgrading will not "fix" > this issue, because it seems this is behaving as designed. I'll need to do > some more research to understand how I can tell when a reconnect has > triggered a new session versus resumed the original session. Checking for > the existence of the znodes after reconnect won't work, because they could > have been deleted and recreated by another app instance that has picked up > the work on behalf of the disconnected instance. If I could see the > client's sid and compare the new connection's sid to the old, I guess I > could assume I'm still the owner of the znodes if they exist, but I'm not > sure if the sid is exposed anywhere in the API (if it is, I haven't found > it yet and would appreciate guidance). It would also be helpful if the > znode's ephemeral owner ID were exposed in the client API, but I don't see > that anywhere in the WatchedEvent API. I guess another possibility is that > I have to append some information onto the znode's path that identifies the > owner, but that would require a major change in our logic that would > introduce a lot of additional complexity. Right now, each app will randomly > try to grab work and register that it is handling the work by creating a > well known path with the work's unique ID. Successful creation of the path > means no other app is handling the work. > > If there is an easier way of managing all of this, please let me > know. The point of using ZooKeeper was to delegate all the messiness of > managing a distributed system, but if I have to have complicated logic to > sense disconnects and then check for the existence of ephemeral znodes > after a reconnect to know whether I'm still the owner of shared work, that > isn't terribly helpful. Hopefully, I'm missing something obvious that makes > this much easier. > > Paul > > > On Oct 30, 2020, at 1:10 PM, Paul Summermatter <[email protected]> > wrote: > > > > Enrico, > > > > Thank you very much for the incredibly rapid reply. I just > discovered that I can indeed correlate the ephemeral owner ID with a > sessions "sid" using the 'cons' command. I discovered that one of the three > ZK instances thinks there is a session with that ID. > > > > Do you or anyone else happen to know if ZK has any issues (either > in the current or older versions) where a session will not be terminated if > the client reconnects within a relatively short period of time? I don't > know how exactly ZK identifies a session or whether the ZK client is trying > to be helpful and attempts to maintain the session when it reconnects by > providing the prior session ID in the new connection request, preventing > the ephemeral nodes from being deleted as I want/expect. > > > > Given our lengthy testing cycle and the fact that we're nearing > the holidays, upgrading ZK won't be possible until next year, but we will > definitely look into it. My only concern is if this is ZK's expected > behavior for some reason, upgrading won't solve the issue. > > > > Regards, > > Paul > > > >> On Oct 30, 2020, at 12:34 PM, Enrico Olivelli <[email protected]> > wrote: > >> > >> Paul, > >> do you have a way to upgrade to the latest ZK 3.6.2 ? > >> many things changed since 3.4.6, it is a pretty old version > >> > >> The session is declared as "expired" on the server side, and this will > in > >> turn trigger the deletion of the ephemeral nodes, if they aren't deleted > >> the session is still active from the servers point of view or there is > some > >> kind of bug > >> > >> Enrico > >> > >> > >> Il giorno ven 30 ott 2020 alle ore 17:30 Paul Summermatter > >> <[email protected]> ha scritto: > >> > >>> RE: ZooKeeper 3.4.6 > >>> > >>> All, > >>> > >>> I'm trying to troubleshoot a problem and could use some guidance > >>> from the experts on ZK administration. I have a cluster of applications > >>> that share work and that create ephemeral nodes representing the work > in ZK > >>> expressly so that, if one application fails, the ephemeral nodes > should be > >>> deleted, and the other apps should be able to pick up the work that is > now > >>> not being completed by the failed instance. > >>> > >>> Yesterday evening, one application instance suffered from some > >>> severe memory pressure and had to run multiple stop the world GC > cycles. > >>> The pauses appear to have triggered a SessionExpiredException in > >>> org.apache.zookeeper.ClientCnxn$SendThread.run (I correlated multiple > >>> "Pause Full" statements in the GC logs with the ZK session timeout in > the > >>> application logs). After the timeout, the connection was > re-established in > >>> under 1,000ms, but the ephemeral nodes remained in ZooKeeper, leaving > them > >>> as orphans. We've seen this behavior before and have had to delete the > >>> nodes manually using the zkCli.sh utility. > >>> > >>> In an attempt to troubleshoot this issue, I'm trying to correlate > >>> the ephemeral owner that is listed on a node when you run the 'get' > command > >>> with the ID of an active session. Basically, I'm trying to understand > >>> whether ZK thinks there is still an active session associated with the > >>> ephemeral node in the hopes that that might lead to an explanation for > why > >>> the ZK server didn't seem to recognize the session timeout sensed on > the > >>> client that triggered a new connection and would explain why the > ephemeral > >>> nodes were not deleted as they should have been when the connection > dropped. > >>> > >>> I've tried the various four letter commands on the server to see > >>> if any of them output anything that looks like the ephemeral owner ID > >>> without any success. Any suggestions/guidance would be greatly > appreciated. > >>> Note, right now, upgrading is not an option, but I'm certainly open to > that > >>> if there are known issues with ephemeral nodes in 3.4 that are > addressed in > >>> newer versions. > >>> > >>> Regards, > >>> Paul > > > >
