[jira] [Comment Edited] (CURATOR-498) LeaderLatch deletes leader and leaves it hung besides a second leader

Shay Shimony (JIRA) Mon, 31 Dec 2018 09:13:59 -0800


    [ 
https://issues.apache.org/jira/browse/CURATOR-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731380#comment-16731380
 ]


Shay Shimony edited comment on CURATOR-498 at 12/31/18 5:06 PM:
----------------------------------------------------------------

That makes sense. Thanks for your clear explanation. But I don't understand 
something regarding the following line:

2018-12-28 16:33:17 INFO  LeaderLatch3:679 - CHANGED PATH: 
/sites/prod/leader-latch/_c_d99020be-d50e-4f16-946d-e8dd6ba34a5b-latch-*0000000085*
  =>  
/sites/prod/leader-latch/_c_0e156808-69bf-4932-b0fd-12d0fb407365-latch-*0000000083*

Doesn't it mean that  setNode(event.getName());  was called with  
event.getName())  == _c_0e156808-69bf-4932-b0fd-12d0fb407365-latch-0000000083  ?

If that's the case, then it is not simply that 0000000083  was returned as the 
list of children of leader-latch, but it is the callback after creating 
0000000083, no?

In fact, if 0000000083  simply returned as child, it would fail leadership (in 
a good way) for the checkLeadership method, unless it was in ourPath. So 
0000000083  had to be set there by the callback before. Because we know 
leadership succeeded for it: "LEADER WITH ... 0000000083".

And we know that 0000000083  was created by the previous dead session. So I 
don't understand how callback can be called for this dead session create 
command? 

On 16:33:14, we saw "ZooKeeper:693 - Session: 0x1000ae5465c0007 closed"

On 16:33:17, we saw the message above with "CHANGED PATH ... 0000000083"

So, I think we mind less that 0000000083 is temporarily there, in the ZK 
follower, even as LeaderLatch-leader, because LeaderLatch knows how to recover 
from that, if it is not in ourPath (upon session expiration, it will be deleted 
by ZK and trigger re-election in LeaderLatch). We care more about ourPath that 
was set with this 0000000083 and deleted 0000000085 - because 0000000083 does 
not belong to that LeaderLatch's current active session (in fact, it belongs to 
no-one's current session) - and from that we don't recover.


was (Author: shayshim):
That makes sense. Thanks for your clear explanation. But I don't understand 
something regarding the following line:

2018-12-28 16:33:17 INFO  LeaderLatch3:679 - CHANGED PATH: 
/sites/prod/leader-latch/_c_d99020be-d50e-4f16-946d-e8dd6ba34a5b-latch-*0000000085*
  =>  
/sites/prod/leader-latch/_c_0e156808-69bf-4932-b0fd-12d0fb407365-latch-*0000000083*

Doesn't it mean that  setNode(event.getName());  was called with  
event.getName())  == _c_0e156808-69bf-4932-b0fd-12d0fb407365-latch-0000000083  ?

If that's the case, then it is not simply that 0000000083  was returned as the 
list of children of leader-latch, but it is the callback after creating 
0000000083, no?

In fact, if 0000000083  simply returned as child, it would fail leadership (in 
a good way) for the checkLeadership method, unless it was in ourPath. So 
0000000083  had to be set there by the callback before. Because we know 
leadership succeeded for it: "LEADER WITH ... 0000000083".

And we know that 0000000083  was created by the previous dead session. So I 
don't understand how callback can be called for this dead session create 
command? 

On 16:33:14, we saw "ZooKeeper:693 - Session: 0x1000ae5465c0007 closed"

On 16:33:17, we saw the message above with "CHANGED PATH ... 0000000083"

Also, I think we mind less that 0000000083 is temporarily there, in the ZK 
follower, even as LeaderLatch-leader, because LeaderLatch knows how to recover 
from that, if it is not in ourPath (upon session expiration, it will be deleted 
by ZK and trigger re-election in LeaderLatch). We care more about ourPath that 
was set with this 0000000083 and deleted 0000000085 - because 0000000083 does 
not belong to that LeaderLatch's current active session (in fact, it belongs to 
no-one's current session) - and from that we don't recover.

> LeaderLatch deletes leader and leaves it hung besides a second leader
> ---------------------------------------------------------------------
>
>                 Key: CURATOR-498
>                 URL: https://issues.apache.org/jira/browse/CURATOR-498
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 4.1.0
>         Environment: ZooKeeper 3.4.13, Curator 4.1.0 (selecting explicitly 
> 3.4.13), Linux
>            Reporter: Shay Shimony
>            Assignee: Jordan Zimmerman
>            Priority: Major
>         Attachments: CURATOR-498.png, HaWatcher.log, LeaderLatch0.java, 
> ha.tar.gz, logs.tar.gz
>
>
> The Curator app I am working on uses the LeaderLatch to select a leader out 
> of 6 clients.
> While testing my app, I noticed that when I make ZK lose its quorum for a 
> while and then restore it, then after Curator in my app restores it's 
> connection to ZK - sometimes not all the 6 clients are found in the latch 
> path (using zkCli.sh). That is, I have 5 instead of 6.
> After investigating a little, I have a suspicion that LeaderLatch deleted the 
> leader in method setNode.
> To investigate it I copied the LeaderLatch code and added some log messages, 
> and from them it seems like very old create() background callback was 
> surprisingly scheduled and corrupted the current leader with its stale path 
> name. Meaning, this old one called setNode with its stale name, and set 
> itself instead of the leader and deleted the leader. This leaves client 
> running, thinking it is the leader, while another leader is selected.
> If my analysis is correct then it seems like we need to make this obsolete 
> create callback cancelled (I think its session was suspended on 22:38:54 and 
> then lost on 22:39:04 - so on SUSPENDED cancel ongoing callbacks).
> Please see attached log file and modified LeaderLatch0.
>  
> In the log, note that on 22:39:26 it shows that 0000000485 is replaced by 
> 0000000480 and then probably deleted.
> Note also that at 22:38:52, 34 seconds before, we can see that it was in the 
> reset() method ("RESET OUR PATH") and possibly triggered the creation of 
> 0000000480 then.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CURATOR-498) LeaderLatch deletes leader and leaves it hung besides a second leader

Reply via email to