Hello Team,

Greetings!
Any update on the below mentioned observation?

Many Thanks,
Viswa

From: Viswanathan Rajagopal <viswanathan.rajag...@workday.com.INVALID>
Date: Wednesday, 27 October 2021 at 16:15
To: dev@curator.apache.org <dev@curator.apache.org>, u...@curator.apache.org 
<u...@curator.apache.org>
Subject: [External Sender] Double Leadership Issue
Hello Team,

Greetings!
While using Curator Leader Latch Recipe in our application,  we observed a 
potential issue where two clients have become a leader. Raised a Jira on the 
same for your reference (Jira Link : 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CURATOR-2D620&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=3LDys_XJLYEnQ0_K3auTUo8DsOom0xZAMAC7ASgkt0A&e=
 )
Quick summary of below description

  *   Our use case explained
  *   Issue details
  *   Timeline of events mentioned
  *   Attached test code to reproduce the reported issue
  *   Possible solution given, where we need your suggestions
Our use case:

  *   Two clients trying to get the leadership using Curator Leader Latch 
Recipe. On LeaderLatchListener.isLeader() Client would become a leader and on 
LeaderLatchListener.notLeader() Client would lose its leadership
Issue details:

  *   One of the clients on receiving two CuratorConnectionListener RECONNECTED 
events in quick succession, we observed that LeaderLatch EventThreads 
interleave with each other, resulting in "latch node deletion" happen after 
"client becoming a leader", thereby the client will still be a leader though 
its corresponding latch node has been deleted
  *   And the other client who tried to get leadership creates its latch node 
and sees itself in first index and thus become a leader
  *   So at this point, two clients have become a leader

Timeline of events:

  *   Timeline events of Client A whose corresponding latch node is deleted but 
still be a leader
     *   At t1, 1st RECONNECTED event fired
     *   At t2, [ EventThread of 1st RECONNECTED event ] Resets leadership 
(true -> false)
     *   At t3, [ EventThread of 1st RECONNECTED event ] Fire 
“listener.notLeader()”
     *   At t4, [ EventThread of 1st RECONNECTED event ] Deletes latch node
     *   At t5, [ EventThread of 1st RECONNECTED event ] Creates new latch node
     *   At t6, 2nd RECONNECTED event fired
     *   At t7, [ EventThread of 2nd RECONNECTED event ] Resets leadership 
(false -> false), Basically NOP
     *   At t8, [ EventThread of 2nd RECONNECTED event ] Fire nothing. 
Basically NOP
     *   At t9, [ EventThread of 1st RECONNECTED event ] Get children -> sort 
them -> check leadership -> Set leadership to true -> Fire “Has become a 
leader” leader listener event
     *   At t10, [ EventThread of 2nd RECONNECTED event ] Delete latch node 
(which actually deletes the latch node with which the Client A has become a 
leader through previous step)

  *   Timeline events of Client B who also become a leader
     *   At t11, Client B creates its latch node -> Get children -> sort them 
-> check leadership -> Set leadership to true -> Fire “Has become a leader” 
leader listener event

This ends up in a situation where both Client A and Client B have become a 
leader

As we observe, over the period (t8 -> t10), Client A’s LeaderLatch EventThreads 
interleave with each other causing leadership latch node deleted but local 
state still assumes that it’s a leader

Reproducing the issue:

  *   Wrote a Junit test case firing an artificial curator connection 
reconnected events and simulated LeaderLatch EventThreads interleave through 
CountDownLatches
  *   Test simulator for 2.5.0:
     *   
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_6a78a3a0de032212175d80caa64f140c743219ae&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=tveG7d6kAd8SeywmuCN7zyd1ufTvARJdEEc0gxTs2rU&e=
     *   
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_d2b1b33a6885c05619c058aa2bee63962fd6fa08&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=jixCmfLZiaseXsSWihiUiYMw8cj5cDg1O6gLFJY3kKg&e=
  *   Test Simulator for latest Curator version:
     *   
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_0949137f7323a1d5f34afc85a7042e8d9e85a8bc&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=bzLny0aqbqUHmvLwkWyLdIySm65swqv2rAT1Kn0MKJ0&e=
     *   
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_1aadd4b5dbc8811a2e7a49b92f29170333e8ba4a&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=GTlqqRRRB_P5y_f1tRSRxv1HZvVjhwFHtlogEk47LAU&e=

Possible Solution (where we would like to hear your thoughts/suggestions):

  *   The current curator code during reset() does
     *   setLeadership(false) first followed by
     *   setNode(null) (i.e. deleting its latch node)

  *   Swapping these two should resolve the issue, as we setting leadership to 
false once after its latch node gets deleted
     *   setNode(null) (i.e. deleting its latch node) first followed by
     *   setLeadership(false)

Many Thanks,
Viswa

Reply via email to