Thanks Jordan,

> If client1's hearbeat fails its main watcher will get a Disconnect event

Suppose the network link betweens client1 and server is at very low
quality (high packet loss rate?) but still fully functional.

Client1 may be happily sending heart-beat-messages to server without
notice anything; but ZK server could be unable to receive
heart-beat-messages from client1 for a long period of time , which
leads ZK server to timeout client1's session, and delete the ephemeral
node.

Thus, client's session could be timeouted by ZK server, without
triggering a Disconnect event.

>Well behaving ZK applications must watch for this and assume that it no longer 
>holds the lock and, thus, should delete its node. If client1 needs the lock 
>again it should try to re-acquire it from step 1 of the recipe. Further, well 
>behaving ZK applications must re-try node deletes if there is a connection 
>problem. Have a look at Curator's implementation for details.

Thanks for pointing me the "Curator's implementation", I will dig into
the source code.

But I still feels that, no matter how well a ZK application behaves,
if we use ephemeral node in the lock-recipe; we can not guarantee "at
any snapshot in time no two clients think they hold the same lock",
which is the fundamental requirement/constraint for a lock.

Mr. Andrey Stepachev suggested that I should use a timer in client
side to track session_timeout, that sounds reasonable; but I think
this implicitly implies some constrains of clock drift - which I am
not expected in a solution based on Zookeeper (ZK is supposed to keep
the animals well).







On Sat, Jan 12, 2013 at 4:20 AM, Jordan Zimmerman
<jor...@jordanzimmerman.com> wrote:
>
> If client1's hearbeat fails its main watcher will get a Disconnect event. 
> Well behaving ZK applications must watch for this and assume that it no 
> longer holds the lock and, thus, should delete its node. If client1 needs the 
> lock again it should try to re-acquire it from step 1 of the recipe. Further, 
> well behaving ZK applications must re-try node deletes if there is a 
> connection problem. Have a look at Curator's implementation for details.
>
> -JZ
>
> On Jan 11, 2013, at 5:46 AM, Zhao Boran <hulunb...@gmail.com> wrote:
>
> > While reading the zookeeper's recipe for
> > lock<http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks>,
> > I get confused:
> >
> > Seems that this recipe-for-distributed-lock can not guarantee *"any
> > snapshot in time no two clients think they hold the same lock"*.
> >
> > But since zookeeper is so widely adopted, if there were such mistakes in
> > the reference doc, someone should have pointed it out long time ago.
> >
> > So, what did I misunderstand? please help me!
> >
> > Recipe-for-distributed-lock (from
> > http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks)
> >
> > Locks
> >
> > Fully distributed locks that are globally synchronous, *meaning at any
> > snapshot in time no two clients think they hold the same lock*. These can
> > be implemented using ZooKeeeper. As with priority queues, first define a
> > lock node.
> >
> >   1. Call create( ) with a pathname of "*locknode*/guid-lock-" and the
> >   sequence and ephemeral flags set.
> >   2. Call getChildren( ) on the lock node without setting the watch flag
> >   (this is important to avoid the herd effect).
> >   3. If the pathname created in step 1 has the lowest sequence number
> >   suffix, the client has the lock and the client exits the protocol.
> >   4. The client calls exists( ) with the watch flag set on the path in the
> >   lock directory with the next lowest sequence number.
> >   5. if exists( ) returns false, go to step 2. Otherwise, wait for a
> >   notification for the pathname from the previous step before going to step 
> > 2.
> >
> > Considering the following case:
> >
> >   -
> >
> >   Client1 successfully acquired the lock(in step3), with zk node
> >   "locknode/guid-lock-0";
> >   -
> >
> >   Client2 created node "locknode/guid-lock-1", failed to acquire the lock,
> >   and watching "locknode/guid-lock-0";
> >   -
> >
> >   Later, for some reasons(network congestion?), client1 failed to send
> >   heart beat message to zk cluster on time, but client1 is still perfectly
> >   working, and assuming itself still holding the lock.
> >   -
> >
> >   But, Zookeeper may think client1's session is timeouted, and then
> >   1. deletes "locknode/guid-lock-0"
> >      2. sends a notification to Client2 (or send the notification first?)
> >      3. but can not send "session timeout" notification to client1 in time
> >      (due to network congestion?)
> >
> >
> >   -
> >
> >   Client2 got the notification, goes to step 2, gets the only node
> >   ""locknode/guid-lock-1", which is created by itself; thus, client2 assumes
> >   it hold the lock.
> >   -
> >
> >   But at the same time, client1 assumes it hold the lock.
> >
> > Is this a valid scenario?
> >
> > Thanks a lot!
>

Reply via email to