[ https://issues.apache.org/jira/browse/CURATOR-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566107#comment-17566107 ]
Zili Chen edited comment on CURATOR-645 at 7/13/22 2:29 AM: ------------------------------------------------------------ I push a patch at [https://github.com/apache/curator/pull/430]. It should fix this issue. was (Author: tison): I push a patch at [https://github.com/apache/curator/pull/430.] It should fix this issue. > LeaderLatch generates infinite loop with two LeaderLatch instances competing > for the leadership > ----------------------------------------------------------------------------------------------- > > Key: CURATOR-645 > URL: https://issues.apache.org/jira/browse/CURATOR-645 > Project: Apache Curator > Issue Type: Bug > Components: Recipes > Affects Versions: 5.2.0 > Reporter: Matthias Pohl > Priority: Major > > We experienced a strange behavior of the LeaderLatch in a test case in Apache > Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for > the leadership resulting in an infinite loop. > The test includes three instances of a wrapper class that has a > {{LeaderLatch}} as a member. This is about > [ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers|https://github.com/apache/flink/blob/7d85b273ccdbd5a2242e05e5d645ea82280f5eea/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperMultipleComponentLeaderElectionDriverTest.java#L236]. > In the test, the first {{LeaderLatch}} acquires the leadership, which > results in the {{LeaderLatch}} to be closed and, as a consequence, losing the > leadership. The odd thing now is that the two left-over {{LeaderLatch}} > instances end up in a infinite loop as shown in the ZooKeeper server logs: > {code:java} > 16:17:07,864 [ SyncThread:0] DEBUG > org.apache.zookeeper.server.FinalRequestProcessor [] - Processing > request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 > zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch > 16:17:07,864 [ SyncThread:0] DEBUG > org.apache.zookeeper.server.FinalRequestProcessor [] - > sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 > zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch > 16:17:07,866 [ SyncThread:0] DEBUG > org.apache.zookeeper.server.FinalRequestProcessor [] - Processing > request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc > txntype:2 reqpath:n/a > 16:17:07,866 [ SyncThread:0] DEBUG > org.apache.zookeeper.server.FinalRequestProcessor [] - > sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 > reqpath:n/a > 16:17:07,869 [ SyncThread:0] DEBUG > org.apache.zookeeper.server.FinalRequestProcessor [] - Processing > request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd > txntype:15 reqpath:n/a > 16:17:07,869 [ SyncThread:0] DEBUG > org.apache.zookeeper.server.FinalRequestProcessor [] - > sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 > reqpath:n/a > 16:17:07,869 [ SyncThread:0] DEBUG > org.apache.zookeeper.server.FinalRequestProcessor [] - Processing > request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 > zxid:0xfffffffffffffffe txntype:unknown > reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001 > {code} > It looks like the close call of the {{LeaderLatch}} with the initial > leadership is in some kind of race condition deleting the corresponding ZNode > and the watcher triggering {{reset()}} for the left-over {{LeaderLatch}} > instances instead of retrieving the left-over children: > # The {{reset()}} triggers > [getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629] > through the > [LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525] > after a new child is created (I would assume {{create2}} entry in the logs > before {{getChildren}} entry which is not the case; so, I might be wrong in > my observation) > # The callback of {{getChildren}} triggers > [checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625]. > # In the meantime, the predecessor gets deleted (I'd assume because of the > deterministic ordering of the events in ZK). This causes the [callback in > checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607] > to fail with a {{NONODE}} event and triggering the reset of the current > {{LeaderLatch}} instance which again triggers the deletion of the current's > {{{}LeaderLatch{}}}'s child zNode and which is executed on the server later > on. -- This message was sent by Atlassian Jira (v8.20.10#820010)