[ https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558357#comment-17558357 ]
Matthias Pohl commented on FLINK-28078: --------------------------------------- The loop consists of the following logs: {code} 16:17:07,864 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,864 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,866 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a 16:17:07,866 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001 {code} # The {{reset()}} triggers [getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629] through the [LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525] after a new child is created (I would assume {{create2}} entry in the logs before {{getChildren}} entry which is not the case; so, I might be wrong in my observation) # The callback of {{getChildren}} triggers [checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625]. # In the meantime, the predecessor gets deleted (I'd assume because of the deterministic ordering of the events in ZK). This causes the [callback in checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607] to fail with a {{NONODE}} event and triggering the reset of the current {{LeaderLatch}} instance which again triggers the deletion of the current's {{LeaderLatch}}'s child zNode and which is executed on the server later on. > ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers > runs into timeout > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-28078 > URL: https://issues.apache.org/jira/browse/FLINK-28078 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.16.0 > Reporter: Matthias Pohl > Assignee: Matthias Pohl > Priority: Major > Labels: test-stability > > [Build > #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455] > got stuck in > {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}} > {code} > "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0 > tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000] > May 30 16:36:10 java.lang.Thread.State: WAITING (parking) > May 30 16:36:10 at sun.misc.Unsafe.park(Native Method) > May 30 16:36:10 - parking to wait for <0x00000000c2571b80> (a > java.util.concurrent.CompletableFuture$Signaller) > May 30 16:36:10 at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > May 30 16:36:10 at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) > May 30 16:36:10 at > org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256) > May 30 16:36:10 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > May 30 16:36:10 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > May 30 16:36:10 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > May 30 16:36:10 at java.lang.reflect.Method.invoke(Method.java:498) > [...] > {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)