[ 
https://issues.apache.org/jira/browse/CURATOR-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566107#comment-17566107
 ] 

Zili Chen edited comment on CURATOR-645 at 7/13/22 2:29 AM:
------------------------------------------------------------

I push a patch at [https://github.com/apache/curator/pull/430]. It should fix 
this issue.


was (Author: tison):
I push a patch at [https://github.com/apache/curator/pull/430.] It should fix 
this issue.

> LeaderLatch generates infinite loop with two LeaderLatch instances competing 
> for the leadership
> -----------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-645
>                 URL: https://issues.apache.org/jira/browse/CURATOR-645
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 5.2.0
>            Reporter: Matthias Pohl
>            Priority: Major
>
> We experienced a strange behavior of the LeaderLatch in a test case in Apache 
> Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for 
> the leadership resulting in an infinite loop.
> The test includes three instances of a wrapper class that has a 
> {{LeaderLatch}} as a member. This is about 
> [ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers|https://github.com/apache/flink/blob/7d85b273ccdbd5a2242e05e5d645ea82280f5eea/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperMultipleComponentLeaderElectionDriverTest.java#L236].
>  In the test, the first {{LeaderLatch}} acquires the leadership, which 
> results in the {{LeaderLatch}} to be closed and, as a consequence, losing the 
> leadership. The odd thing now is that the two left-over {{LeaderLatch}} 
> instances end up in a infinite loop as shown in the ZooKeeper server logs:
> {code:java}
> 16:17:07,864 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
> request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 
> zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
> 16:17:07,864 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - 
> sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 
> zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
> 16:17:07,866 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
> request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc 
> txntype:2 reqpath:n/a
> 16:17:07,866 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - 
> sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 
> reqpath:n/a
> 16:17:07,869 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
> request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd 
> txntype:15 reqpath:n/a
> 16:17:07,869 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - 
> sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 
> reqpath:n/a
> 16:17:07,869 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
> request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 
> zxid:0xfffffffffffffffe txntype:unknown 
> reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
> {code}
> It looks like the close call of the {{LeaderLatch}} with the initial 
> leadership is in some kind of race condition deleting the corresponding ZNode 
> and the watcher triggering {{reset()}} for the left-over {{LeaderLatch}} 
> instances instead of retrieving the left-over children:
>  # The {{reset()}} triggers 
> [getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629]
>  through the 
> [LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525]
>  after a new child is created (I would assume {{create2}} entry in the logs 
> before {{getChildren}} entry which is not the case; so, I might be wrong in 
> my observation)
>  # The callback of {{getChildren}} triggers 
> [checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625].
>  # In the meantime, the predecessor gets deleted (I'd assume because of the 
> deterministic ordering of the events in ZK). This causes the [callback in 
> checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607]
>  to fail with a {{NONODE}} event and triggering the reset of the current 
> {{LeaderLatch}} instance which again triggers the deletion of the current's 
> {{{}LeaderLatch{}}}'s child zNode and which is executed on the server later 
> on.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to