hey Viswa, Sorry, cut and paste error on my part. This PR here;
https://github.com/apache/curator/pull/398 Looks like it may be fixing at least a similar problem. I'll try and take a look in more detail when I get a minute, but my time for Curator is currently very limited. cheers On Wed, Nov 3, 2021 at 2:25 PM Viswanathan Rajagopal < viswanathan.rajag...@workday.com> wrote: > Hi Cam, > > Thanks for getting back > > > > Yes, that was me who had opened Curator Jira. I had raised Curator Jira > initially, but since there were no responses, thought to open a > conversation on the same. > > > > I have also referenced this Jira link in my original conversation below > > > > Many Thanks, > > Viswa > > > > *From: *Cameron McKenzie <cammcken...@apache.org> > *Date: *Tuesday, 2 November 2021 at 21:25 > *To: *dev@curator.apache.org <dev@curator.apache.org> > *Cc: *u...@curator.apache.org <u...@curator.apache.org> > *Subject: *[External Sender] Re: Double Leadership Issue > > hey Viswa, > I haven't had a chance to look at it in any detail yet, but > superficially it sounds like it has some similarities to this PR? > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CURATOR-2D620&d=DwIFaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjVR9Zfxevy2QbIqXzpZz32m&m=4quNit2CApic0UneDxdldPSbKfjBRrFPluHQspXgUQt1HBy_V319jPgxWrKsYi76&s=DfAE8YU4ITE_OOcDW9R_uI5yK3Z-zDSl1gXGpsLiK9Y&e= > > cheers > Cam > > > On Tue, Nov 2, 2021 at 10:48 PM Viswanathan Rajagopal > <viswanathan.rajag...@workday.com.invalid> wrote: > > > Hello Team, > > > > Greetings! > > Any update on the below mentioned observation? > > > > Many Thanks, > > Viswa > > > > From: Viswanathan Rajagopal <viswanathan.rajag...@workday.com.INVALID> > > Date: Wednesday, 27 October 2021 at 16:15 > > To: dev@curator.apache.org <dev@curator.apache.org>, > > u...@curator.apache.org <u...@curator.apache.org> > > Subject: [External Sender] Double Leadership Issue > > Hello Team, > > > > Greetings! > > While using Curator Leader Latch Recipe in our application, we observed > a > > potential issue where two clients have become a leader. Raised a Jira on > > the same for your reference (Jira Link : > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CURATOR-2D620&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=3LDys_XJLYEnQ0_K3auTUo8DsOom0xZAMAC7ASgkt0A&e= > > ) > > Quick summary of below description > > > > * Our use case explained > > * Issue details > > * Timeline of events mentioned > > * Attached test code to reproduce the reported issue > > * Possible solution given, where we need your suggestions > > Our use case: > > > > * Two clients trying to get the leadership using Curator Leader Latch > > Recipe. On LeaderLatchListener.isLeader() Client would become a leader > and > > on LeaderLatchListener.notLeader() Client would lose its leadership > > Issue details: > > > > * One of the clients on receiving two CuratorConnectionListener > > RECONNECTED events in quick succession, we observed that LeaderLatch > > EventThreads interleave with each other, resulting in "latch node > deletion" > > happen after "client becoming a leader", thereby the client will still > be a > > leader though its corresponding latch node has been deleted > > * And the other client who tried to get leadership creates its latch > > node and sees itself in first index and thus become a leader > > * So at this point, two clients have become a leader > > > > Timeline of events: > > > > * Timeline events of Client A whose corresponding latch node is > > deleted but still be a leader > > * At t1, 1st RECONNECTED event fired > > * At t2, [ EventThread of 1st RECONNECTED event ] Resets > leadership > > (true -> false) > > * At t3, [ EventThread of 1st RECONNECTED event ] Fire > > “listener.notLeader()” > > * At t4, [ EventThread of 1st RECONNECTED event ] Deletes latch > node > > * At t5, [ EventThread of 1st RECONNECTED event ] Creates new > latch > > node > > * At t6, 2nd RECONNECTED event fired > > * At t7, [ EventThread of 2nd RECONNECTED event ] Resets > leadership > > (false -> false), Basically NOP > > * At t8, [ EventThread of 2nd RECONNECTED event ] Fire nothing. > > Basically NOP > > * At t9, [ EventThread of 1st RECONNECTED event ] Get children -> > > sort them -> check leadership -> Set leadership to true -> Fire “Has > become > > a leader” leader listener event > > * At t10, [ EventThread of 2nd RECONNECTED event ] Delete latch > > node (which actually deletes the latch node with which the Client A has > > become a leader through previous step) > > > > * Timeline events of Client B who also become a leader > > * At t11, Client B creates its latch node -> Get children -> sort > > them -> check leadership -> Set leadership to true -> Fire “Has become a > > leader” leader listener event > > > > This ends up in a situation where both Client A and Client B have become > a > > leader > > > > As we observe, over the period (t8 -> t10), Client A’s LeaderLatch > > EventThreads interleave with each other causing leadership latch node > > deleted but local state still assumes that it’s a leader > > > > Reproducing the issue: > > > > * Wrote a Junit test case firing an artificial curator connection > > reconnected events and simulated LeaderLatch EventThreads interleave > > through CountDownLatches > > * Test simulator for 2.5.0: > > * > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_6a78a3a0de032212175d80caa64f140c743219ae&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=tveG7d6kAd8SeywmuCN7zyd1ufTvARJdEEc0gxTs2rU&e= > > * > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_d2b1b33a6885c05619c058aa2bee63962fd6fa08&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=jixCmfLZiaseXsSWihiUiYMw8cj5cDg1O6gLFJY3kKg&e= > > * Test Simulator for latest Curator version: > > * > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_0949137f7323a1d5f34afc85a7042e8d9e85a8bc&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=bzLny0aqbqUHmvLwkWyLdIySm65swqv2rAT1Kn0MKJ0&e= > > * > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_1aadd4b5dbc8811a2e7a49b92f29170333e8ba4a&d=DwIF-g&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU&m=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8&s=GTlqqRRRB_P5y_f1tRSRxv1HZvVjhwFHtlogEk47LAU&e= > > > > Possible Solution (where we would like to hear your > thoughts/suggestions): > > > > * The current curator code during reset() does > > * setLeadership(false) first followed by > > * setNode(null) (i.e. deleting its latch node) > > > > * Swapping these two should resolve the issue, as we setting > > leadership to false once after its latch node gets deleted > > * setNode(null) (i.e. deleting its latch node) first followed by > > * setLeadership(false) > > > > Many Thanks, > > Viswa > > >