(crossposting to dev@zookeeper)

Hi ZooKeepers,
can anyone take a look at this problem an user found  while using Curator ?

Thanks in advance
Enrico

Il giorno mar 20 lug 2021 alle ore 09:01 Cameron McKenzie <
cammcken...@apache.org> ha scritto:

> hey Viswa,
> I'm by no means an expert on this chunk of code, but I've done a bit of
> digging and it certainly seems that you've uncovered an issue.
>
> Ultimately the root cause of the issue is the weirdness in the way that ZK
> is handling ephemeral nodes. I'm not sure if this is intentional or a bug,
> but I would have thought that if the ephemeral nodes are tied to a session
> then they should be removed as soon as the session has expired.
>
> From the Curator standpoint, it appears that the InterProcessMutex has been
> written with the assumption that ephemeral nodes are deleted when their
> session expires. To fix it on the Curator side, I think that we would need
> to provide a way to interrupt the acquire() method, so that when the
> connection goes into a SUSPENDED state we can force the restart of the
> acquisition method. I guess you could just explicitly interrupt the thread
> when your ConnectionStateListener gets a SUSPENDED event, but this is a bit
> ugly.
>
> Might be worth raising the issue on the ZK lists to see if this is a bug or
> by design.
>
> Any other devs have any thoughts?
> cheers
>
>
>
>
>
>
> On Tue, Jul 20, 2021 at 3:45 AM Viswanathan Rajagopal
> <viswanathan.rajag...@workday.com.invalid> wrote:
>
> > Hi Team,
> >
> > Good day.
> >
> > Recently came across “Double Locking Issue (i.e. two clients acquiring
> > lock)” using Curator code ( InterProcessMutex lock APIs ) in our
> application
> >
> > Our use case:
> >
> >   *   Two clients attempts to acquire the zookeeper lock using Curator
> > InterProcessMutex and whoever owns it would release it once sees the
> > connection disconnect ( on receiving Connection.SUSPENDED /
> Connection.LOST
> > Curator Connection Events from Connection listener)
> >
> > Issue we noticed:
> >
> >   *   After session expired & reconnected with new session, both client
> > seems to have acquired the lock. Interesting thing that we found is that
> > one of the clients still holds the lock while its lock node (ephemeral)
> was
> > gone
> >
> > Things we found:
> >
> >   *   Based on our initial analysis and few test runs, we saw that
> Curator
> > acquire() method acquires the lock based on “about to be deleted lock
> node
> > of previous session”. Explanation : Ephemeral node created by previous
> > session was  still seen by client that reconnected with new session id
> > until server cleans that up. If this happens, Curator acquire() would
> hold
> > the lock.
> >
> >
> >
> >   *   Clearly we could see the race condition (in zookeeper code) between
> > 1). Client reconnecting to server with new session id and 2). server
> > deleting the ephemeral nodes of client’s previous session. We were able
> to
> > reproduce this issue using the following approach,
> >      *   Artificially break the socket connection between client and
> > server for 30s
> >      *   Artificially pausing the set of server codes for a min and
> unpause
> >
> >
> >   *   On the above mentioned race condition, if client manage to
> reconnect
> > to server with new session id before server cleans up the ephemeral nodes
> > of client’s previous session,  Curator lock acquire() who is trying to
> > acquire the lock will hold the lock as it still sees the lock node in
> > zookeeper directory. Eventually server would be cleaning up the ephemeral
> > nodes leaving the Curator local lock thread data stale giving the
> illusion
> > that it still hold the lock while its ephemeral node is gone
> >
> >
> >   *   Timeline events described below for better understanding,
> >      *   At t1, Client A and Client B establishes zookeeper session with
> > session id A1 and B1 respectively
> >      *   At t2, Client A creates the lock node N1 & acquires the lock
> >      *   At t3, Client B creates the lock node N2 & blocked on acquire()
> > to acquire the lock
> >      *   At t4, session timed out for both clients & server is about to
> > clean up the old session • Client A trying to release the lock
> >      *   At t5, Client A and Client B reconnects to server with new
> > session id A2 and B2 respectively before server deletes the ephemeral
> node
> > N1 & N2 of previous client session. Client A releases the lock, deleting
> N1
> > and trying to acquire it again by creating N3 node and Client B who is
> > blocked on acquire() acquires the lock based on N2 (about to be deleted
> > node created by previous session)
> >      *   At  t6, server cleans up the ephemeral node N2 created by Client
> > B’s previous session. Client A acquires the lock with node N3 as its
> > previous sequence N2 gets deleted, whereas Client B who incorrectly
> > acquired the lock at t5 timeline still holds the lock
> >
> >
> > Note:
> >
> >   *   We are not certain that this race condition that we noticed in
> > zookeeper code is intentional design.
> >
> > Questions:
> >
> >   *   Given this race condition seen in zookeeper code, we would like to
> > hear your recommendations / suggestions to avoid this issue while using
> > Curator lock code?
> >
> >
> >
> >   *   We also see that Interprocess Mutex has the makeRevocable() API
> that
> > enables application to revoke lock, but that handles Node change event
> only
> > but not for Node deleted event. I understand that this makes sense to
> > handle Node change event alone, as to enable application user to revoke
> > lock externally from application side. But would it be also to okay to
> have
> > one for Node delete event, so as application can register the listener
> for
> > Node delete event. I would like to hear your thoughts.
> >
> >
> > Many Thanks,
> > Viswa
> >
>

Reply via email to