[ 
https://issues.apache.org/jira/browse/CURATOR-620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kezhu Wang closed CURATOR-620.
------------------------------
    Resolution: Fixed

> Double Leadership Issue while using Leader Latch Recipe
> -------------------------------------------------------
>
>                 Key: CURATOR-620
>                 URL: https://issues.apache.org/jira/browse/CURATOR-620
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.5.0, 5.2.0
>         Environment: Production
>            Reporter: Viswanathan Rajagopal
>            Priority: Major
>
> {color:#de350b}While using Curator Leader Latch Recipe in our application,  
> we observed a potential issue where two clients have become a leader (Double 
> Leadership Issue).{color}
> {color:#0747a6}*Quick summary of below description*{color}
>  * {color:#0747a6}*Our use case explained*{color}
>  * {color:#0747a6}*Issue details*{color}
>  * {color:#0747a6}*Timeline of events mentioned*{color}
>  * {color:#0747a6}*Attached test code to reproduce the reported issue*{color}
>  * {color:#0747a6}*Possible solution given, where we need your suggestions* 
> {color}
> +*Our use case:*+
>  * Two clients trying to get the leadership using Curator Leader Latch 
> Recipe. On LeaderLatchListener.isLeader() Client would become a leader and on 
> LeaderLatchListener.notLeader() Client would lose its leadership
> +*Issue details:*+
>  * One of the clients on receiving two CuratorConnectionListener RECONNECTED 
> events in quick succession, we observed that LeaderLatch EventThreads 
> interleave with each other, resulting in "latch node deletion" happen after 
> "client becoming a leader", thereby the client will still be a leader though 
> its corresponding latch node has been deleted
>  * And the other client who tried to get leadership creates its latch node 
> and sees itself in first index and thus become a leader
>  * So at this point, two clients have become a leader
> +*Timeline of events:*+
>  * *Timeline events of Client A* whose corresponding latch node is deleted 
> but still be a leader
>  ** At t1, 1st RECONNECTED event fired
>  ** At t2, [ EventThread of 1st RECONNECTED event ] Resets leadership (true 
> -> false)
>  ** At t3, [ EventThread of 1st RECONNECTED event ] Fire 
> “listener.notLeader()”
>  ** At t4, [ EventThread of 1st RECONNECTED event ] Deletes latch node
>  ** At t5, [ EventThread of 1st RECONNECTED event ] Creates new latch node
>  ** At t6, 2nd RECONNECTED event fired
>  ** At t7, [ EventThread of 2nd RECONNECTED event ] Resets leadership (false 
> -> false), Basically NOP
>  ** At t8, [ EventThread of 2nd RECONNECTED event ] Fire nothing. Basically 
> NOP
>  ** At t9, [ EventThread of 1st RECONNECTED event ] Get children -> sort them 
> -> check leadership -> Set leadership to true -> Fire “Has become a leader” 
> leader listener event
>  ** At t10, [ EventThread of 2nd RECONNECTED event ] Delete latch node (which 
> actually deletes the latch node with which the Client A has become a leader 
> through previous step)
>  * *Timeline events of Client B* who also become a leader
>  ** At t11, Client B creates its latch node -> Get children -> sort them -> 
> check leadership -> Set leadership to true -> Fire “Has become a leader” 
> leader listener event
> This ends up in a situation where both Client A and Client B have become a 
> leader
> As we observe, over the period (t8 -> t10), Client A’s LeaderLatch 
> EventThreads interleave with each other causing leadership latch node deleted 
> but local state still assumes that it’s a leader
> +*Reproducing the issue:*+
>  * Wrote a Junit test case firing an artificial curator connection 
> reconnected events and simulated LeaderLatch EventThreads interleave through 
> CountDownLatches
>  * *Test simulator for 2.5.0:*
>  ** 
> [https://github.com/ViswaNXplore/curator/commit/6a78a3a0de032212175d80caa64f140c743219ae]
>  ** 
> [https://github.com/ViswaNXplore/curator/commit/d2b1b33a6885c05619c058aa2bee63962fd6fa08]
>  * *Test Simulator for latest Curator version:*
>  ** 
> [https://github.com/ViswaNXplore/curator/commit/0949137f7323a1d5f34afc85a7042e8d9e85a8bc]
>  ** 
> [https://github.com/ViswaNXplore/curator/commit/1aadd4b5dbc8811a2e7a49b92f29170333e8ba4a]
> +*Possible Solution (where we would like to hear your thoughts/suggestions):*+
>  * The current curator code during reset() does
>  ** setLeadership(false) first followed by
>  ** setNode(null) (i.e. deleting its latch node)
>  * Swapping these two should resolve the issue, as we setting leadership to 
> false once after its latch node gets deleted
>  ** setNode(null) (i.e. deleting its latch node) first followed by
>  ** setLeadership(false)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to