Paul, do you have a way to upgrade to the latest ZK 3.6.2 ? many things changed since 3.4.6, it is a pretty old version
The session is declared as "expired" on the server side, and this will in turn trigger the deletion of the ephemeral nodes, if they aren't deleted the session is still active from the servers point of view or there is some kind of bug Enrico Il giorno ven 30 ott 2020 alle ore 17:30 Paul Summermatter <[email protected]> ha scritto: > RE: ZooKeeper 3.4.6 > > All, > > I'm trying to troubleshoot a problem and could use some guidance > from the experts on ZK administration. I have a cluster of applications > that share work and that create ephemeral nodes representing the work in ZK > expressly so that, if one application fails, the ephemeral nodes should be > deleted, and the other apps should be able to pick up the work that is now > not being completed by the failed instance. > > Yesterday evening, one application instance suffered from some > severe memory pressure and had to run multiple stop the world GC cycles. > The pauses appear to have triggered a SessionExpiredException in > org.apache.zookeeper.ClientCnxn$SendThread.run (I correlated multiple > "Pause Full" statements in the GC logs with the ZK session timeout in the > application logs). After the timeout, the connection was re-established in > under 1,000ms, but the ephemeral nodes remained in ZooKeeper, leaving them > as orphans. We've seen this behavior before and have had to delete the > nodes manually using the zkCli.sh utility. > > In an attempt to troubleshoot this issue, I'm trying to correlate > the ephemeral owner that is listed on a node when you run the 'get' command > with the ID of an active session. Basically, I'm trying to understand > whether ZK thinks there is still an active session associated with the > ephemeral node in the hopes that that might lead to an explanation for why > the ZK server didn't seem to recognize the session timeout sensed on the > client that triggered a new connection and would explain why the ephemeral > nodes were not deleted as they should have been when the connection dropped. > > I've tried the various four letter commands on the server to see > if any of them output anything that looks like the ephemeral owner ID > without any success. Any suggestions/guidance would be greatly appreciated. > Note, right now, upgrading is not an option, but I'm certainly open to that > if there are known issues with ephemeral nodes in 3.4 that are addressed in > newer versions. > > Regards, > Paul
