[ https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228330#comment-17228330 ]
Yang Wang edited comment on FLINK-20045 at 11/9/20, 3:12 AM: ------------------------------------------------------------- I think the root cause is the leader was granted too fast and the {{TestingLeaderElectionEventHandler#init()}} has not been called. So we could find the following exception in the maven log. This could not happen in the production code since we have a {{lock}} in {{DefaultLeaderElectionService}}. How to fix the unstable tests? I suggest to add a "wait-with-timeout" in the {{TestingLeaderElectionEventHandler#onGrantLeadership, #onRevokeLeadership, #onLeaderInformationChange}} so that we have enough time for creating {{LeaderElectionDriver}} and then {{init}} the {{TestingLeaderElectionEventHandler}}. {code:java} 10:30:37,419 [Curator-LeaderLatch-0] WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.10:30:37,419 [Curator-LeaderLatch-0] WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.10:30:37,468 [ main-EventThread] ERROR org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer [] - Listener (ZooKeeperLeaderElectionDriver{leaderPath='/leader'}) threw an exceptionorg.apache.flink.util.FlinkRuntimeException: init() should be called first. at org.apache.flink.runtime.leaderelection.TestingLeaderElectionEventHandler.onGrantLeadership(TestingLeaderElectionEventHandler.java:46) ~[test-classes/:?] at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.isLeader(ZooKeeperLeaderElectionDriver.java:158) ~[classes/:?] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:693) ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:689) ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:688) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:567) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.access$700(LeaderLatch.java:65) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:618) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:601) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0]10:33:58,379 [ main] INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver [] - Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader'}10:33:58,383 [ main] INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader'}.10:33:58,385 [ Curator-Framework-0] INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - backgroundOperationsLoop exiting10:33:58,388 [ main] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper [] - Session: 0x102de7dc7fc0000 closed10:33:58,389 [ main-EventThread] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - EventThread shut down for session: 0x102de7dc7fc000010:33:58,392 [ main] ERROR org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest [] - {code} was (Author: fly_in_gis): I think the root cause is the leader was granted too fast and the {{TestingLeaderElectionEventHandler#init()}} has not been called. So we could find the following exception in the maven log. This could not happen in the production code since we have a {{lock}} in {{DefaultLeaderElectionService}}. How to fix the unstable tests? I suggest to add a "wait-with-timeout" in the {{TestingLeaderElectionEventHandler}} so that we have enough time for creating {{LeaderElectionDriver}} and then {{init}} the {{TestingLeaderElectionEventHandler}}. {code:java} 10:30:37,419 [Curator-LeaderLatch-0] WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.10:30:37,419 [Curator-LeaderLatch-0] WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.10:30:37,468 [ main-EventThread] ERROR org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer [] - Listener (ZooKeeperLeaderElectionDriver{leaderPath='/leader'}) threw an exceptionorg.apache.flink.util.FlinkRuntimeException: init() should be called first. at org.apache.flink.runtime.leaderelection.TestingLeaderElectionEventHandler.onGrantLeadership(TestingLeaderElectionEventHandler.java:46) ~[test-classes/:?] at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.isLeader(ZooKeeperLeaderElectionDriver.java:158) ~[classes/:?] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:693) ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:689) ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:688) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:567) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.access$700(LeaderLatch.java:65) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:618) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:601) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0]10:33:58,379 [ main] INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver [] - Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader'}10:33:58,383 [ main] INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader'}.10:33:58,385 [ Curator-Framework-0] INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - backgroundOperationsLoop exiting10:33:58,388 [ main] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper [] - Session: 0x102de7dc7fc0000 closed10:33:58,389 [ main-EventThread] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - EventThread shut down for session: 0x102de7dc7fc000010:33:58,392 [ main] ERROR org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest [] - {code} > ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with > "TimeoutException: Contender was not elected as the leader within 200000ms" > -------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-20045 > URL: https://issues.apache.org/jira/browse/FLINK-20045 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.12.0 > Reporter: Dian Fu > Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > 2020-11-07T10:34:07.5063203Z [ERROR] > testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest) > Time elapsed: 202.445 s <<< ERROR! > 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender > was not elected as the leader within 200000ms > 2020-11-07T10:34:07.5064946Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153) > 2020-11-07T10:34:07.5065762Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139) > 2020-11-07T10:34:07.5066565Z at > org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48) > 2020-11-07T10:34:07.5067185Z at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)