[GitHub] [curator] tisonkun opened a new pull request, #430: CURATOR-644. CURATOR-645. Fix livelock in LeaderLatch

GitBox Tue, 12 Jul 2022 19:23:03 -0700


tisonkun opened a new pull request, #430:
URL: https://github.com/apache/curator/pull/430

See also:

* https://issues.apache.org/jira/browse/CURATOR-644
* https://issues.apache.org/jira/browse/CURATOR-645

## Livelock in details

Here we have two race conditions to cause livelock:

Case 1. Suppose there are two participants, p0 and p1:

T0. p1 is going to watch on preceding node belongs to p0.
T1. p0 gets reconnected, and thus reset its node, and create a new node to
prepare watch on p1's node.
T2. p1 find preceding node has gone, and reset itself.

At the moment, p0 and p1 can be in the livelock that never see each other's
node and infinitely reset themselves. This is the case reported by CURATOR-645.

Case 2. The similar case can happen even if there is only one participant p:

T0. In thread 0 (th0) p is going to `checkLeadership` and before it gets
`outPath.get()`.
T1. In thread 1 (th1) p gets reconnected and calls `reset`, now
`outPath.get() == null`.
T2. th0 gets `outPath.get() == null` and is going to `reset`.
T3. th1 creates its new node and prepare to gets `ourPath.get()`
T4. th0 calls `reset()`.

At the moment, inside the same participant there are two thread competing
each other and thus in a livelock. This is the case reported by CURATOR-644.

## Solution

I make two significant changes to resolve these livelock cases:

1. Call `getChildren` instead of `reset` when preceding node not found in
callback. This is previously reported in
https://github.com/apache/curator/commit/ff4ec29f5958cc9162f0302c02f4ec50c0e796cd#r31770630.
I don't find a reason we perform different between callback and watcher for
the same condition. And concurrent `reset`s are the trigger for these livelock.
2. Call `getChildren` instead of `reset` when recovered from connection
loss. The reason is similar to 1, while if a connection loss or session expire
cause our node to be deleted, when `checkLeadership` we can see the condition
and call `reset`.

These changes should fix CURATOR-645 and release the case in CURATOR-644.
However, as long as there's possibility to generate concurrent
`checkLeadership` a participant can race itself. I ever thought we can use a
`checkLeadershipLock` here but since all client request are handled in
callbacks, the lock can protect little.

I'm trying to add test cases and such changes must involve more eyes. Also
if you have an idea to fix one participant multiple threads race condition,
please comment.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [curator] tisonkun opened a new pull request, #430: CURATOR-644. CURATOR-645. Fix livelock in LeaderLatch

Reply via email to