(crossposting to dev@zookeeper) Hi ZooKeepers, can anyone take a look at this problem an user found while using Curator ?
Thanks in advance Enrico Il giorno mar 20 lug 2021 alle ore 09:01 Cameron McKenzie < cammcken...@apache.org> ha scritto: > hey Viswa, > I'm by no means an expert on this chunk of code, but I've done a bit of > digging and it certainly seems that you've uncovered an issue. > > Ultimately the root cause of the issue is the weirdness in the way that ZK > is handling ephemeral nodes. I'm not sure if this is intentional or a bug, > but I would have thought that if the ephemeral nodes are tied to a session > then they should be removed as soon as the session has expired. > > From the Curator standpoint, it appears that the InterProcessMutex has been > written with the assumption that ephemeral nodes are deleted when their > session expires. To fix it on the Curator side, I think that we would need > to provide a way to interrupt the acquire() method, so that when the > connection goes into a SUSPENDED state we can force the restart of the > acquisition method. I guess you could just explicitly interrupt the thread > when your ConnectionStateListener gets a SUSPENDED event, but this is a bit > ugly. > > Might be worth raising the issue on the ZK lists to see if this is a bug or > by design. > > Any other devs have any thoughts? > cheers > > > > > > > On Tue, Jul 20, 2021 at 3:45 AM Viswanathan Rajagopal > <viswanathan.rajag...@workday.com.invalid> wrote: > > > Hi Team, > > > > Good day. > > > > Recently came across “Double Locking Issue (i.e. two clients acquiring > > lock)” using Curator code ( InterProcessMutex lock APIs ) in our > application > > > > Our use case: > > > > * Two clients attempts to acquire the zookeeper lock using Curator > > InterProcessMutex and whoever owns it would release it once sees the > > connection disconnect ( on receiving Connection.SUSPENDED / > Connection.LOST > > Curator Connection Events from Connection listener) > > > > Issue we noticed: > > > > * After session expired & reconnected with new session, both client > > seems to have acquired the lock. Interesting thing that we found is that > > one of the clients still holds the lock while its lock node (ephemeral) > was > > gone > > > > Things we found: > > > > * Based on our initial analysis and few test runs, we saw that > Curator > > acquire() method acquires the lock based on “about to be deleted lock > node > > of previous session”. Explanation : Ephemeral node created by previous > > session was still seen by client that reconnected with new session id > > until server cleans that up. If this happens, Curator acquire() would > hold > > the lock. > > > > > > > > * Clearly we could see the race condition (in zookeeper code) between > > 1). Client reconnecting to server with new session id and 2). server > > deleting the ephemeral nodes of client’s previous session. We were able > to > > reproduce this issue using the following approach, > > * Artificially break the socket connection between client and > > server for 30s > > * Artificially pausing the set of server codes for a min and > unpause > > > > > > * On the above mentioned race condition, if client manage to > reconnect > > to server with new session id before server cleans up the ephemeral nodes > > of client’s previous session, Curator lock acquire() who is trying to > > acquire the lock will hold the lock as it still sees the lock node in > > zookeeper directory. Eventually server would be cleaning up the ephemeral > > nodes leaving the Curator local lock thread data stale giving the > illusion > > that it still hold the lock while its ephemeral node is gone > > > > > > * Timeline events described below for better understanding, > > * At t1, Client A and Client B establishes zookeeper session with > > session id A1 and B1 respectively > > * At t2, Client A creates the lock node N1 & acquires the lock > > * At t3, Client B creates the lock node N2 & blocked on acquire() > > to acquire the lock > > * At t4, session timed out for both clients & server is about to > > clean up the old session • Client A trying to release the lock > > * At t5, Client A and Client B reconnects to server with new > > session id A2 and B2 respectively before server deletes the ephemeral > node > > N1 & N2 of previous client session. Client A releases the lock, deleting > N1 > > and trying to acquire it again by creating N3 node and Client B who is > > blocked on acquire() acquires the lock based on N2 (about to be deleted > > node created by previous session) > > * At t6, server cleans up the ephemeral node N2 created by Client > > B’s previous session. Client A acquires the lock with node N3 as its > > previous sequence N2 gets deleted, whereas Client B who incorrectly > > acquired the lock at t5 timeline still holds the lock > > > > > > Note: > > > > * We are not certain that this race condition that we noticed in > > zookeeper code is intentional design. > > > > Questions: > > > > * Given this race condition seen in zookeeper code, we would like to > > hear your recommendations / suggestions to avoid this issue while using > > Curator lock code? > > > > > > > > * We also see that Interprocess Mutex has the makeRevocable() API > that > > enables application to revoke lock, but that handles Node change event > only > > but not for Node deleted event. I understand that this makes sense to > > handle Node change event alone, as to enable application user to revoke > > lock externally from application side. But would it be also to okay to > have > > one for Node delete event, so as application can register the listener > for > > Node delete event. I would like to hear your thoughts. > > > > > > Many Thanks, > > Viswa > > >