[
https://issues.apache.org/jira/browse/CURATOR-620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kezhu Wang closed CURATOR-620.
------------------------------
Resolution: Fixed
> Double Leadership Issue while using Leader Latch Recipe
> -------------------------------------------------------
>
> Key: CURATOR-620
> URL: https://issues.apache.org/jira/browse/CURATOR-620
> Project: Apache Curator
> Issue Type: Bug
> Components: Recipes
> Affects Versions: 2.5.0, 5.2.0
> Environment: Production
> Reporter: Viswanathan Rajagopal
> Priority: Major
>
> {color:#de350b}While using Curator Leader Latch Recipe in our application,
> we observed a potential issue where two clients have become a leader (Double
> Leadership Issue).{color}
> {color:#0747a6}*Quick summary of below description*{color}
> * {color:#0747a6}*Our use case explained*{color}
> * {color:#0747a6}*Issue details*{color}
> * {color:#0747a6}*Timeline of events mentioned*{color}
> * {color:#0747a6}*Attached test code to reproduce the reported issue*{color}
> * {color:#0747a6}*Possible solution given, where we need your suggestions*
> {color}
> +*Our use case:*+
> * Two clients trying to get the leadership using Curator Leader Latch
> Recipe. On LeaderLatchListener.isLeader() Client would become a leader and on
> LeaderLatchListener.notLeader() Client would lose its leadership
> +*Issue details:*+
> * One of the clients on receiving two CuratorConnectionListener RECONNECTED
> events in quick succession, we observed that LeaderLatch EventThreads
> interleave with each other, resulting in "latch node deletion" happen after
> "client becoming a leader", thereby the client will still be a leader though
> its corresponding latch node has been deleted
> * And the other client who tried to get leadership creates its latch node
> and sees itself in first index and thus become a leader
> * So at this point, two clients have become a leader
> +*Timeline of events:*+
> * *Timeline events of Client A* whose corresponding latch node is deleted
> but still be a leader
> ** At t1, 1st RECONNECTED event fired
> ** At t2, [ EventThread of 1st RECONNECTED event ] Resets leadership (true
> -> false)
> ** At t3, [ EventThread of 1st RECONNECTED event ] Fire
> “listener.notLeader()”
> ** At t4, [ EventThread of 1st RECONNECTED event ] Deletes latch node
> ** At t5, [ EventThread of 1st RECONNECTED event ] Creates new latch node
> ** At t6, 2nd RECONNECTED event fired
> ** At t7, [ EventThread of 2nd RECONNECTED event ] Resets leadership (false
> -> false), Basically NOP
> ** At t8, [ EventThread of 2nd RECONNECTED event ] Fire nothing. Basically
> NOP
> ** At t9, [ EventThread of 1st RECONNECTED event ] Get children -> sort them
> -> check leadership -> Set leadership to true -> Fire “Has become a leader”
> leader listener event
> ** At t10, [ EventThread of 2nd RECONNECTED event ] Delete latch node (which
> actually deletes the latch node with which the Client A has become a leader
> through previous step)
> * *Timeline events of Client B* who also become a leader
> ** At t11, Client B creates its latch node -> Get children -> sort them ->
> check leadership -> Set leadership to true -> Fire “Has become a leader”
> leader listener event
> This ends up in a situation where both Client A and Client B have become a
> leader
> As we observe, over the period (t8 -> t10), Client A’s LeaderLatch
> EventThreads interleave with each other causing leadership latch node deleted
> but local state still assumes that it’s a leader
> +*Reproducing the issue:*+
> * Wrote a Junit test case firing an artificial curator connection
> reconnected events and simulated LeaderLatch EventThreads interleave through
> CountDownLatches
> * *Test simulator for 2.5.0:*
> **
> [https://github.com/ViswaNXplore/curator/commit/6a78a3a0de032212175d80caa64f140c743219ae]
> **
> [https://github.com/ViswaNXplore/curator/commit/d2b1b33a6885c05619c058aa2bee63962fd6fa08]
> * *Test Simulator for latest Curator version:*
> **
> [https://github.com/ViswaNXplore/curator/commit/0949137f7323a1d5f34afc85a7042e8d9e85a8bc]
> **
> [https://github.com/ViswaNXplore/curator/commit/1aadd4b5dbc8811a2e7a49b92f29170333e8ba4a]
> +*Possible Solution (where we would like to hear your thoughts/suggestions):*+
> * The current curator code during reset() does
> ** setLeadership(false) first followed by
> ** setNode(null) (i.e. deleting its latch node)
> * Swapping these two should resolve the issue, as we setting leadership to
> false once after its latch node gets deleted
> ** setNode(null) (i.e. deleting its latch node) first followed by
> ** setLeadership(false)
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)