Re: Locks based on ephemeral nodes - Handling network outage correctly

Ted Dunning Fri, 14 Oct 2011 09:49:24 -0700

Correct.  So you can't actually do a correct retry logic for some error
conditions.


On Fri, Oct 14, 2011 at 9:44 AM, Jordan Zimmerman <jzimmer...@netflix.com>wrote:

> True. But, it wouldn't be possible to get KeeperException.Code.NODEEXISTS
> for sequential files, right?
>
> -JZ
>
> On 10/14/11 9:41 AM, "Ted Dunning" <ted.dunn...@gmail.com> wrote:
>
> >Yes.  That works fine with idempotent operations like creating a
> >non-sequential file.
> >
> >Of course, it doesn't work with sequential files since you don't know who
> >created any other znodes out there.
> >
> >On Fri, Oct 14, 2011 at 9:39 AM, Jordan Zimmerman
> ><jzimmer...@netflix.com>wrote:
> >
> >> FYI - Curator checks for KeeperException.Code.NODEEXISTS in its retry
> >>loop
> >> and just ignores it treating it as a success. I'm not sure if other
> >> libraries do that. So, this is a case that a disconnection can be
> >>handled
> >> generically.
> >>
> >> -JZ
> >>
> >> On 10/14/11 7:20 AM, "Fournier, Camille F." <camille.fourn...@gs.com>
> >> wrote:
> >>
> >> >Pretty much all of the Java client wrappers out there in the wild have
> >> >some sort of a retry loop around operations, to make some of this
> >>easier
> >> >to deal with. But they don't to my knowledge deal with the situation of
> >> >knowing whether an operation succeeded in the case of a disconnect (it
> >>is
> >> >possible to push out a request, and get a disconnect back before you
> >>get
> >> >a response for that request so you don't know if your request succeeded
> >> >or failed). So you may end up, for example, writing something twice in
> >> >the case of writing a SEQUENTIAL-type node. For many use cases of
> >> >sequential, this isn't a big deal.
> >> >
> >> >I don't know of anything that handles this in a more subtle way than
> >> >simply retrying. As Ted has mentioned in earlier emails on the subject,
> >> >" You can't just assume that you can retry an operation on Zookeeper
> >>and
> >> >get the right result.  The correct handling is considerably more
> >>subtle.
> >> >Hiding that is not a good thing unless you say right up front that you
> >> >are compromising either expressivity (as does Kept Collections) or
> >> >correctness (as does zkClient)."
> >> >
> >> >It's not clear to me that it is possible to write a generic client to
> >> >"correctly" handle retries on disconnect because what correct means
> >> >varies from use case to use case. One of the challenges I think for
> >> >getting comfortable with using ZK is knowing the correctness bounds for
> >> >your particular use case and understanding the failure scenarios wrt
> >>that
> >> >use case and ZK.
> >> >
> >> >C
> >> >
> >> >
> >> >-----Original Message-----
> >> >From: Mike Schilli [mailto:m...@perlmeister.com]
> >> >Sent: Thursday, October 13, 2011 9:27 PM
> >> >To: user@zookeeper.apache.org
> >> >Subject: Re: Locks based on ephemeral nodes - Handling network outage
> >> >correctly
> >> >
> >> >On Wed, 12 Oct 2011, Ted Dunning wrote:
> >> >
> >> >> ZK will tell you when the connection is lost (but not yet expired).
> >> >>When
> >> >> this happens, the application needs to pay attention and pause before
> >> >> continuing to assume it still has the lock.
> >> >
> >> >I think this applies to every write operation in ZooKeeper, which I
> >>find
> >> >is a challenge to deal with.
> >> >
> >> >So basically, every time an application writes something to ZooKeeper,
> >> >it needs to check the result, but what to do if it fails? Check if it's
> >> >an error indicating the connection was lost, and try a couple of times
> >> >to reinstate the connection and replay the write? At least, that's what
> >> >the documentation of the Perl Wrapper in Net::ZooKeeper suggests.
> >> >
> >> >Are there best practices around this, or, better yet, a client API that
> >> >actually implements this, so the application doesn't have to implement
> >> >a ZooKeeper wrapper? Something like "retry 3 times with 10 second waits
> >> >in between and fail otherwise"`.
> >> >
> >> >-- -- Mike
> >> >
> >> >Mike Schilli
> >> >m...@perlmeister.com
> >> >
> >> >
> >> >
> >> >>
> >> >> 2011/10/12 Frédéric Jolliton <frede...@jolliton.com>
> >> >>
> >> >>> Hello all,
> >> >>>
> >> >>> There is something that bother me about ephemeral nodes.
> >> >>>
> >> >>> I need to create some locks using Zookeeper. I followed the
> >>"official"
> >> >>> recipe, except that I don't use the EPHEMERAL flag. The reason for
> >>that
> >> >>> is that I don't know how I should proceed if the connection to
> >> >>>Zookeeper
> >> >>> ensemble is ever lost. But otherwise, everything works nicely.
> >> >>>
> >> >>> The EPHEMERAL flag is useful if the owner of the lock disappear
> >> >>>(exiting
> >> >>> abnormally). From the point of view of the Zookeeper ensemble, the
> >> >>> connection time out (or is closed explicitly), the lock is released.
> >> >>> That's great.
> >> >>>
> >> >>> However, if I lose the connection temporarily (network outage), the
> >> >>> Zookeeper ensemble again see the connection timing out.. but
> >>actually
> >> >>> the owner of the lock is still there doing some work on the locked
> >> >>> resource. But the lock is released by Zookeeper anyway.
> >> >>>
> >> >>> How should this case be handled?
> >> >>>
> >> >>> All I can see is that the owner can only verify that the lock was no
> >> >>> longer owned because releasing the lock will give a Session Expired
> >> >>> error (assuming we retry reconnecting while we get a Connection Loss
> >> >>> error) or because of an event sent at some point because the
> >>connection
> >> >>> was also closed automatically on the client side by libkeeper (not
> >>sure
> >> >>> about this last point). Knowing that the connection expired
> >>necessary
> >> >>> mean that the lock was lost but it may be too late.
> >> >>>
> >> >>> I mean that there is a short time lapse where the process that own
> >>the
> >> >>> lock have not tried to release it yet and thus don't know it lost
> >>it,
> >> >>> and another process was able to acquire it too in the meantime.
> >>This is
> >> >>> a big problem.
> >> >>>
> >> >>> That's why I avoid the EPHEMERAL flag for now, and plan to rely on
> >> >>> periodic cleaning task to drop locks no longer owned by some
> >>process (a
> >> >>> task which is not trivial either.)
> >> >>>
> >> >>> I would appreciate any tips to handle such situation in a better
> >>way.
> >> >>> What is your experience in such cases?
> >> >>>
> >> >>> Regards,
> >> >>>
> >> >>> --
> >> >>> Frédéric Jolliton
> >> >>> Outscale SAS
> >> >>>
> >> >>>
> >> >>
> >> >
> >>
> >>
>
>

Re: Locks based on ephemeral nodes - Handling network outage correctly

Reply via email to