[ 
https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228330#comment-17228330
 ] 

Yang Wang edited comment on FLINK-20045 at 11/9/20, 3:12 AM:
-------------------------------------------------------------

I think the root cause is the leader was granted too fast and the 
{{TestingLeaderElectionEventHandler#init()}} has not been called. So we could 
find the following exception in the maven log. This could not happen in the 
production code since we have a {{lock}} in {{DefaultLeaderElectionService}}.

 

How to fix the unstable tests?

I suggest to add a "wait-with-timeout" in the 
{{TestingLeaderElectionEventHandler#onGrantLeadership, #onRevokeLeadership, 
#onLeaderInformationChange}} so that we have enough time for creating 
{{LeaderElectionDriver}} and then {{init}} the 
{{TestingLeaderElectionEventHandler}}.

 
{code:java}
10:30:37,419 [Curator-LeaderLatch-0] WARN  
org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - The 
version of ZooKeeper being used doesn't support Container nodes. 
CreateMode.PERSISTENT will be used instead.10:30:37,419 [Curator-LeaderLatch-0] 
WARN  org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - 
The version of ZooKeeper being used doesn't support Container nodes. 
CreateMode.PERSISTENT will be used instead.10:30:37,468 [    main-EventThread] 
ERROR 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer
 [] - Listener (ZooKeeperLeaderElectionDriver{leaderPath='/leader'}) threw an 
exceptionorg.apache.flink.util.FlinkRuntimeException: init() should be called 
first. at 
org.apache.flink.runtime.leaderelection.TestingLeaderElectionEventHandler.onGrantLeadership(TestingLeaderElectionEventHandler.java:46)
 ~[test-classes/:?] at 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.isLeader(ZooKeeperLeaderElectionDriver.java:158)
 ~[classes/:?] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:693)
 ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:689)
 ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:688)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:567)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.access$700(LeaderLatch.java:65)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:618)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:601)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0]10:33:58,379 [           
     main] INFO  
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver [] - 
Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader'}10:33:58,383 [       
         main] INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - 
Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader'}.10:33:58,385 [ 
Curator-Framework-0] INFO  
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl
 [] - backgroundOperationsLoop exiting10:33:58,388 [                main] INFO  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper [] - Session: 
0x102de7dc7fc0000 closed10:33:58,389 [    main-EventThread] INFO  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
EventThread shut down for session: 0x102de7dc7fc000010:33:58,392 [              
  main] ERROR 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest [] - 
{code}


was (Author: fly_in_gis):
I think the root cause is the leader was granted too fast and the 
{{TestingLeaderElectionEventHandler#init()}} has not been called. So we could 
find the following exception in the maven log. This could not happen in the 
production code since we have a {{lock}} in {{DefaultLeaderElectionService}}.

 

How to fix the unstable tests?

I suggest to add a "wait-with-timeout" in the 
{{TestingLeaderElectionEventHandler}} so that we have enough time for creating 
{{LeaderElectionDriver}} and then {{init}} the 
{{TestingLeaderElectionEventHandler}}.

 
{code:java}
10:30:37,419 [Curator-LeaderLatch-0] WARN  
org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - The 
version of ZooKeeper being used doesn't support Container nodes. 
CreateMode.PERSISTENT will be used instead.10:30:37,419 [Curator-LeaderLatch-0] 
WARN  org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - 
The version of ZooKeeper being used doesn't support Container nodes. 
CreateMode.PERSISTENT will be used instead.10:30:37,468 [    main-EventThread] 
ERROR 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer
 [] - Listener (ZooKeeperLeaderElectionDriver{leaderPath='/leader'}) threw an 
exceptionorg.apache.flink.util.FlinkRuntimeException: init() should be called 
first. at 
org.apache.flink.runtime.leaderelection.TestingLeaderElectionEventHandler.onGrantLeadership(TestingLeaderElectionEventHandler.java:46)
 ~[test-classes/:?] at 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.isLeader(ZooKeeperLeaderElectionDriver.java:158)
 ~[classes/:?] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:693)
 ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:689)
 ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:688)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:567)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.access$700(LeaderLatch.java:65)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:618)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:601)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
 [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0]10:33:58,379 [           
     main] INFO  
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver [] - 
Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader'}10:33:58,383 [       
         main] INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - 
Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader'}.10:33:58,385 [ 
Curator-Framework-0] INFO  
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl
 [] - backgroundOperationsLoop exiting10:33:58,388 [                main] INFO  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper [] - Session: 
0x102de7dc7fc0000 closed10:33:58,389 [    main-EventThread] INFO  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
EventThread shut down for session: 0x102de7dc7fc000010:33:58,392 [              
  main] ERROR 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest [] - 
{code}

> ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with 
> "TimeoutException: Contender was not elected as the leader within 200000ms"
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-20045
>                 URL: https://issues.apache.org/jira/browse/FLINK-20045
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Dian Fu
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> 2020-11-07T10:34:07.5063203Z [ERROR] 
> testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest)
>   Time elapsed: 202.445 s  <<< ERROR!
> 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender 
> was not elected as the leader within 200000ms
> 2020-11-07T10:34:07.5064946Z  at 
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153)
> 2020-11-07T10:34:07.5065762Z  at 
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139)
> 2020-11-07T10:34:07.5066565Z  at 
> org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48)
> 2020-11-07T10:34:07.5067185Z  at 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to