Hi Jordan,

The dual leadership continue indefinitely in my case

Many Thanks,
Viswa

From: Jordan Zimmerman <jor...@jordanzimmerman.com>
Date: Wednesday, 3 November 2021 at 08:02
To: dev@curator.apache.org <dev@curator.apache.org>
Cc: u...@curator.apache.org <u...@curator.apache.org>
Subject: [External Sender] Re: Double Leadership Issue
Do I understand this correctly that there are two leaders for a short period of 
time - i.e. it corrects itself eventually? Or does the dual leadership continue 
indefinitely?

-Jordan

> On Nov 2, 2021, at 11:48 AM, Viswanathan Rajagopal 
> <viswanathan.rajag...@workday.com.INVALID> wrote:
>
> Hello Team,
>
> Greetings!
> Any update on the below mentioned observation?
>
> Many Thanks,
> Viswa
>
> From: Viswanathan Rajagopal <viswanathan.rajag...@workday.com.INVALID>
> Date: Wednesday, 27 October 2021 at 16:15
> To: dev@curator.apache.org <dev@curator.apache.org>, u...@curator.apache.org 
> <u...@curator.apache.org>
> Subject: [External Sender] Double Leadership Issue
> Hello Team,
>
> Greetings!
> While using Curator Leader Latch Recipe in our application,  we observed a 
> potential issue where two clients have become a leader. Raised a Jira on the 
> same for your reference (Jira Link : 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CURATOR-2D620&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=3LDys_XJLYEnQ0_K3auTUo8DsOom0xZAMAC7ASgkt0A&e=
>  )
> Quick summary of below description
>
>  *   Our use case explained
>  *   Issue details
>  *   Timeline of events mentioned
>  *   Attached test code to reproduce the reported issue
>  *   Possible solution given, where we need your suggestions
> Our use case:
>
>  *   Two clients trying to get the leadership using Curator Leader Latch 
> Recipe. On LeaderLatchListener.isLeader() Client would become a leader and on 
> LeaderLatchListener.notLeader() Client would lose its leadership
> Issue details:
>
>  *   One of the clients on receiving two CuratorConnectionListener 
> RECONNECTED events in quick succession, we observed that LeaderLatch 
> EventThreads interleave with each other, resulting in "latch node deletion" 
> happen after "client becoming a leader", thereby the client will still be a 
> leader though its corresponding latch node has been deleted
>  *   And the other client who tried to get leadership creates its latch node 
> and sees itself in first index and thus become a leader
>  *   So at this point, two clients have become a leader
>
> Timeline of events:
>
>  *   Timeline events of Client A whose corresponding latch node is deleted 
> but still be a leader
>     *   At t1, 1st RECONNECTED event fired
>     *   At t2, [ EventThread of 1st RECONNECTED event ] Resets leadership 
> (true -> false)
>     *   At t3, [ EventThread of 1st RECONNECTED event ] Fire 
> “listener.notLeader()”
>     *   At t4, [ EventThread of 1st RECONNECTED event ] Deletes latch node
>     *   At t5, [ EventThread of 1st RECONNECTED event ] Creates new latch node
>     *   At t6, 2nd RECONNECTED event fired
>     *   At t7, [ EventThread of 2nd RECONNECTED event ] Resets leadership 
> (false -> false), Basically NOP
>     *   At t8, [ EventThread of 2nd RECONNECTED event ] Fire nothing. 
> Basically NOP
>     *   At t9, [ EventThread of 1st RECONNECTED event ] Get children -> sort 
> them -> check leadership -> Set leadership to true -> Fire “Has become a 
> leader” leader listener event
>     *   At t10, [ EventThread of 2nd RECONNECTED event ] Delete latch node 
> (which actually deletes the latch node with which the Client A has become a 
> leader through previous step)
>
>  *   Timeline events of Client B who also become a leader
>     *   At t11, Client B creates its latch node -> Get children -> sort them 
> -> check leadership -> Set leadership to true -> Fire “Has become a leader” 
> leader listener event
>
> This ends up in a situation where both Client A and Client B have become a 
> leader
>
> As we observe, over the period (t8 -> t10), Client A’s LeaderLatch 
> EventThreads interleave with each other causing leadership latch node deleted 
> but local state still assumes that it’s a leader
>
> Reproducing the issue:
>
>  *   Wrote a Junit test case firing an artificial curator connection 
> reconnected events and simulated LeaderLatch EventThreads interleave through 
> CountDownLatches
>  *   Test simulator for 2.5.0:
>     *   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_6a78a3a0de032212175d80caa64f140c743219ae&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=tveG7d6kAd8SeywmuCN7zyd1ufTvARJdEEc0gxTs2rU&e=
>     *   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_d2b1b33a6885c05619c058aa2bee63962fd6fa08&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=jixCmfLZiaseXsSWihiUiYMw8cj5cDg1O6gLFJY3kKg&e=
>  *   Test Simulator for latest Curator version:
>     *   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_0949137f7323a1d5f34afc85a7042e8d9e85a8bc&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=bzLny0aqbqUHmvLwkWyLdIySm65swqv2rAT1Kn0MKJ0&e=
>     *   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_1aadd4b5dbc8811a2e7a49b92f29170333e8ba4a&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=GTlqqRRRB_P5y_f1tRSRxv1HZvVjhwFHtlogEk47LAU&e=
>
> Possible Solution (where we would like to hear your thoughts/suggestions):
>
>  *   The current curator code during reset() does
>     *   setLeadership(false) first followed by
>     *   setNode(null) (i.e. deleting its latch node)
>
>  *   Swapping these two should resolve the issue, as we setting leadership to 
> false once after its latch node gets deleted
>     *   setNode(null) (i.e. deleting its latch node) first followed by
>     *   setLeadership(false)
>
> Many Thanks,
> Viswa

Reply via email to