Re: Locks based on ephemeral nodes - Handling network outage correctly

Ted Dunning Fri, 14 Oct 2011 09:42:34 -0700

Yes.  That works fine with idempotent operations like creating a
non-sequential file.


Of course, it doesn't work with sequential files since you don't know who
created any other znodes out there.

On Fri, Oct 14, 2011 at 9:39 AM, Jordan Zimmerman <jzimmer...@netflix.com>wrote:

> FYI - Curator checks for KeeperException.Code.NODEEXISTS in its retry loop
> and just ignores it treating it as a success. I'm not sure if other
> libraries do that. So, this is a case that a disconnection can be handled
> generically.
>
> -JZ
>
> On 10/14/11 7:20 AM, "Fournier, Camille F." <camille.fourn...@gs.com>
> wrote:
>
> >Pretty much all of the Java client wrappers out there in the wild have
> >some sort of a retry loop around operations, to make some of this easier
> >to deal with. But they don't to my knowledge deal with the situation of
> >knowing whether an operation succeeded in the case of a disconnect (it is
> >possible to push out a request, and get a disconnect back before you get
> >a response for that request so you don't know if your request succeeded
> >or failed). So you may end up, for example, writing something twice in
> >the case of writing a SEQUENTIAL-type node. For many use cases of
> >sequential, this isn't a big deal.
> >
> >I don't know of anything that handles this in a more subtle way than
> >simply retrying. As Ted has mentioned in earlier emails on the subject,
> >" You can't just assume that you can retry an operation on Zookeeper and
> >get the right result.  The correct handling is considerably more subtle.
> >Hiding that is not a good thing unless you say right up front that you
> >are compromising either expressivity (as does Kept Collections) or
> >correctness (as does zkClient)."
> >
> >It's not clear to me that it is possible to write a generic client to
> >"correctly" handle retries on disconnect because what correct means
> >varies from use case to use case. One of the challenges I think for
> >getting comfortable with using ZK is knowing the correctness bounds for
> >your particular use case and understanding the failure scenarios wrt that
> >use case and ZK.
> >
> >C
> >
> >
> >-----Original Message-----
> >From: Mike Schilli [mailto:m...@perlmeister.com]
> >Sent: Thursday, October 13, 2011 9:27 PM
> >To: user@zookeeper.apache.org
> >Subject: Re: Locks based on ephemeral nodes - Handling network outage
> >correctly
> >
> >On Wed, 12 Oct 2011, Ted Dunning wrote:
> >
> >> ZK will tell you when the connection is lost (but not yet expired).
> >>When
> >> this happens, the application needs to pay attention and pause before
> >> continuing to assume it still has the lock.
> >
> >I think this applies to every write operation in ZooKeeper, which I find
> >is a challenge to deal with.
> >
> >So basically, every time an application writes something to ZooKeeper,
> >it needs to check the result, but what to do if it fails? Check if it's
> >an error indicating the connection was lost, and try a couple of times
> >to reinstate the connection and replay the write? At least, that's what
> >the documentation of the Perl Wrapper in Net::ZooKeeper suggests.
> >
> >Are there best practices around this, or, better yet, a client API that
> >actually implements this, so the application doesn't have to implement
> >a ZooKeeper wrapper? Something like "retry 3 times with 10 second waits
> >in between and fail otherwise"`.
> >
> >-- -- Mike
> >
> >Mike Schilli
> >m...@perlmeister.com
> >
> >
> >
> >>
> >> 2011/10/12 Frédéric Jolliton <frede...@jolliton.com>
> >>
> >>> Hello all,
> >>>
> >>> There is something that bother me about ephemeral nodes.
> >>>
> >>> I need to create some locks using Zookeeper. I followed the "official"
> >>> recipe, except that I don't use the EPHEMERAL flag. The reason for that
> >>> is that I don't know how I should proceed if the connection to
> >>>Zookeeper
> >>> ensemble is ever lost. But otherwise, everything works nicely.
> >>>
> >>> The EPHEMERAL flag is useful if the owner of the lock disappear
> >>>(exiting
> >>> abnormally). From the point of view of the Zookeeper ensemble, the
> >>> connection time out (or is closed explicitly), the lock is released.
> >>> That's great.
> >>>
> >>> However, if I lose the connection temporarily (network outage), the
> >>> Zookeeper ensemble again see the connection timing out.. but actually
> >>> the owner of the lock is still there doing some work on the locked
> >>> resource. But the lock is released by Zookeeper anyway.
> >>>
> >>> How should this case be handled?
> >>>
> >>> All I can see is that the owner can only verify that the lock was no
> >>> longer owned because releasing the lock will give a Session Expired
> >>> error (assuming we retry reconnecting while we get a Connection Loss
> >>> error) or because of an event sent at some point because the connection
> >>> was also closed automatically on the client side by libkeeper (not sure
> >>> about this last point). Knowing that the connection expired necessary
> >>> mean that the lock was lost but it may be too late.
> >>>
> >>> I mean that there is a short time lapse where the process that own the
> >>> lock have not tried to release it yet and thus don't know it lost it,
> >>> and another process was able to acquire it too in the meantime. This is
> >>> a big problem.
> >>>
> >>> That's why I avoid the EPHEMERAL flag for now, and plan to rely on
> >>> periodic cleaning task to drop locks no longer owned by some process (a
> >>> task which is not trivial either.)
> >>>
> >>> I would appreciate any tips to handle such situation in a better way.
> >>> What is your experience in such cases?
> >>>
> >>> Regards,
> >>>
> >>> --
> >>> Frédéric Jolliton
> >>> Outscale SAS
> >>>
> >>>
> >>
> >
>
>

Re: Locks based on ephemeral nodes - Handling network outage correctly

Reply via email to