tisonkun opened a new pull request, #430: URL: https://github.com/apache/curator/pull/430
See also: * https://issues.apache.org/jira/browse/CURATOR-644 * https://issues.apache.org/jira/browse/CURATOR-645 ## Livelock in details Here we have two race conditions to cause livelock: Case 1. Suppose there are two participants, p0 and p1: T0. p1 is going to watch on preceding node belongs to p0. T1. p0 gets reconnected, and thus reset its node, and create a new node to prepare watch on p1's node. T2. p1 find preceding node has gone, and reset itself. At the moment, p0 and p1 can be in the livelock that never see each other's node and infinitely reset themselves. This is the case reported by CURATOR-645. Case 2. The similar case can happen even if there is only one participant p: T0. In thread 0 (th0) p is going to `checkLeadership` and before it gets `outPath.get()`. T1. In thread 1 (th1) p gets reconnected and calls `reset`, now `outPath.get() == null`. T2. th0 gets `outPath.get() == null` and is going to `reset`. T3. th1 creates its new node and prepare to gets `ourPath.get()` T4. th0 calls `reset()`. At the moment, inside the same participant there are two thread competing each other and thus in a livelock. This is the case reported by CURATOR-644. ## Solution I make two significant changes to resolve these livelock cases: 1. Call `getChildren` instead of `reset` when preceding node not found in callback. This is previously reported in https://github.com/apache/curator/commit/ff4ec29f5958cc9162f0302c02f4ec50c0e796cd#r31770630. I don't find a reason we perform different between callback and watcher for the same condition. And concurrent `reset`s are the trigger for these livelock. 2. Call `getChildren` instead of `reset` when recovered from connection loss. The reason is similar to 1, while if a connection loss or session expire cause our node to be deleted, when `checkLeadership` we can see the condition and call `reset`. These changes should fix CURATOR-645 and release the case in CURATOR-644. However, as long as there's possibility to generate concurrent `checkLeadership` a participant can race itself. I ever thought we can use a `checkLeadershipLock` here but since all client request are handled in callbacks, the lock can protect little. I'm trying to add test cases and such changes must involve more eyes. Also if you have an idea to fix one participant multiple threads race condition, please comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@curator.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org