[ 
https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558357#comment-17558357
 ] 

Matthias Pohl commented on FLINK-28078:
---------------------------------------

The loop consists of the following logs:
{code}
16:17:07,864 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 
zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
16:17:07,864 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - 
sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/default/latch
16:17:07,866 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 
reqpath:n/a
16:17:07,866 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - 
sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd 
txntype:15 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - 
sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 
reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 
zxid:0xfffffffffffffffe txntype:unknown 
reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
{code}
# The {{reset()}} triggers 
[getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629]
 through the 
[LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525]
 after a new child is created (I would assume {{create2}} entry in the logs 
before {{getChildren}} entry which is not the case; so, I might be wrong in my 
observation)
# The callback of {{getChildren}} triggers 
[checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625].
# In the meantime, the predecessor gets deleted (I'd assume because of the 
deterministic ordering of the events in ZK). This causes the [callback in 
checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607]
 to fail with a {{NONODE}} event and triggering the reset of the current 
{{LeaderLatch}} instance which again triggers the deletion of the current's 
{{LeaderLatch}}'s child zNode and which is executed on the server later on.

> ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers
>  runs into timeout
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-28078
>                 URL: https://issues.apache.org/jira/browse/FLINK-28078
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.0
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Major
>              Labels: test-stability
>
> [Build 
> #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455]
>  got stuck in 
> {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}}
> {code}
> "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0 
> tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000]
> May 30 16:36:10    java.lang.Thread.State: WAITING (parking)
> May 30 16:36:10       at sun.misc.Unsafe.park(Native Method)
> May 30 16:36:10       - parking to wait for  <0x00000000c2571b80> (a 
> java.util.concurrent.CompletableFuture$Signaller)
> May 30 16:36:10       at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> May 30 16:36:10       at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> May 30 16:36:10       at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
> May 30 16:36:10       at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> May 30 16:36:10       at 
> java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
> May 30 16:36:10       at 
> org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256)
> May 30 16:36:10       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> May 30 16:36:10       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 30 16:36:10       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 30 16:36:10       at java.lang.reflect.Method.invoke(Method.java:498)
> [...]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to