That didn’t really answer any of my questions. 

If I own a lock, I am entitled to do some work exclusively. No one else should 
be doing that work. If I get disconnected or the session times out I have to 
stop working. Somebody else will take over the work in a short time. If I 
understood the programmers guide correctly, the expired event will not be 
delivered to me until I reconnect. Correct? So, I have to use the disconnected 
event to initiate a graceful stop. Stopping work might take some time, e.g. 
because I am doing a REST service call that takes up to 20s. Let’s say doing 
the call twice leads to data corruption in the backend service (e.g. HTTP POST, 
which is non idempotent). So, ideally, if I am still running, I should try my 
best to complete normally. If the state of the work units is kept in ZK, I 
cannot update the state anyway. If I store it in some other datastore, I might 
be able to update the state or not (depending on how the network has been 
partitioned).

The more I think about it, the harder it seems to get this stuff working 
reliably. What if my node crashes? I cannot complete my work normally. So, 
whoever takes over my work will try to redo it anyways. Either the receiver is 
made idempotent (which is not always possible) or the new work owner needs to 
be aware of the aborted task and be extra cautious, e.g. by checking whether 
the work unit has completed or not. It seems to me that making the “crash” case 
the default (i.e. “crash” the worker thread whenever a disconnected event is 
received) is the best solution. Then I am forced to make the crash case robust. 
Guess that’s what some people call “crash-only design”.

Simon



> On 13 Sep 2015, at 03:19 , Jordan Zimmerman <[email protected]> 
> wrote:
> 
> I used to advise that people treat Disconnected the same as session loss as 
> it’s safer. But, you can also set a timer when Disconnected is received and 
> when your session timeout elapses you can then consider session loss (note, 
> use the negotiated value from the ZK handle). FYI - version 3.0.0 of Apache 
> Curator will have an option to choose this alternate method.
> 
> -Jordan
> 
> 
> 
> On September 12, 2015 at 4:47:46 PM, Simon ([email protected] 
> <mailto:[email protected]>) wrote:
> 
>> Hi 
>> 
>> I am trying to get a better understanding of Zookeeper and how it should be 
>> used. Let’s talk about the lock recipe 
>> (http://zookeeper.apache.org/doc/r3.4.6/recipes.html#sc_recipes_Locks).  
>> 
>> - X aquires the lock 
>> - X does some long running work (longer than the session timeout) 
>> - X gets partioned away from the quorum while it was doing some work 
>> - after some time (determined by the timeout passed to ZK) Y will aquire the 
>> lock 
>> 
>> In that situation both X and Y are holding the lock (unless X is acting 
>> properly). If I understand the documentation correctly 
>> (http://zookeeper.apache.org/doc/r3.4.6/zookeeperProgrammers.html#ch_zkSessions),
>>  X would receive a disconnected event in that situation (but not an expired 
>> event unless it successfully reconnects). So, X should stop doing the work 
>> it is doing until it gets reconnected. How much time does X have to stop the 
>> work it is doing? i.e. how long does it take from disconnected event sent to 
>> X to expiration of the ephemeral node used for the lock? Having two clients 
>> inside a critical section protected by a lock would not be a good idea. 
>> 
>> Regards, 
>> Simon

Reply via email to