[jira] [Commented] (FLINK-20045) ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with "TimeoutException: Contender was not elected as the leader within 200000ms"
[ https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228928#comment-17228928 ] Dian Fu commented on FLINK-20045: - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9361&view=logs&j=6bfdaf55-0c08-5e3f-a2d2-2a0285fd41cf&t=fd9796c3-9ce8-5619-781c-42f873e126a6 > ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with > "TimeoutException: Contender was not elected as the leader within 20ms" > > > Key: FLINK-20045 > URL: https://issues.apache.org/jira/browse/FLINK-20045 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Till Rohrmann >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > 2020-11-07T10:34:07.5063203Z [ERROR] > testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest) > Time elapsed: 202.445 s <<< ERROR! > 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender > was not elected as the leader within 20ms > 2020-11-07T10:34:07.5064946Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153) > 2020-11-07T10:34:07.5065762Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139) > 2020-11-07T10:34:07.5066565Z at > org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48) > 2020-11-07T10:34:07.5067185Z at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20045) ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with "TimeoutException: Contender was not elected as the leader within 200000ms"
[ https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228646#comment-17228646 ] Yang Wang commented on FLINK-20045: --- Make sense to me. > ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with > "TimeoutException: Contender was not elected as the leader within 20ms" > > > Key: FLINK-20045 > URL: https://issues.apache.org/jira/browse/FLINK-20045 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Till Rohrmann >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > 2020-11-07T10:34:07.5063203Z [ERROR] > testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest) > Time elapsed: 202.445 s <<< ERROR! > 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender > was not elected as the leader within 20ms > 2020-11-07T10:34:07.5064946Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153) > 2020-11-07T10:34:07.5065762Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139) > 2020-11-07T10:34:07.5066565Z at > org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48) > 2020-11-07T10:34:07.5067185Z at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20045) ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with "TimeoutException: Contender was not elected as the leader within 200000ms"
[ https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228585#comment-17228585 ] Till Rohrmann commented on FLINK-20045: --- Maybe it is also fine to simply block the {{onGrantLeadership}} and {{revokeLeadership}} until {{init}} has been called. > ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with > "TimeoutException: Contender was not elected as the leader within 20ms" > > > Key: FLINK-20045 > URL: https://issues.apache.org/jira/browse/FLINK-20045 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Critical > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > 2020-11-07T10:34:07.5063203Z [ERROR] > testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest) > Time elapsed: 202.445 s <<< ERROR! > 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender > was not elected as the leader within 20ms > 2020-11-07T10:34:07.5064946Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153) > 2020-11-07T10:34:07.5065762Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139) > 2020-11-07T10:34:07.5066565Z at > org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48) > 2020-11-07T10:34:07.5067185Z at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20045) ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with "TimeoutException: Contender was not elected as the leader within 200000ms"
[ https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228543#comment-17228543 ] Yang Wang commented on FLINK-20045: --- I agree with you that {{TestingLeaderElectionEventHandler}} could not be responsible for writing the leader information back to external storage(e.g. ZooKeeper, K8s ConfigMap). Then in some driver tests, we need to manually call the {{LeaderElectionDriver#writeLeaderInformation}} so that the {{LeaderRetrievalDriver}} could get the leader information. Does it make sense to you? Or you still believe that we need to use the {{DefaultLeaderElectionService}} which could write back the leader information to external storage automatically. > ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with > "TimeoutException: Contender was not elected as the leader within 20ms" > > > Key: FLINK-20045 > URL: https://issues.apache.org/jira/browse/FLINK-20045 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Critical > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > 2020-11-07T10:34:07.5063203Z [ERROR] > testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest) > Time elapsed: 202.445 s <<< ERROR! > 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender > was not elected as the leader within 20ms > 2020-11-07T10:34:07.5064946Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153) > 2020-11-07T10:34:07.5065762Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139) > 2020-11-07T10:34:07.5066565Z at > org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48) > 2020-11-07T10:34:07.5067185Z at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20045) ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with "TimeoutException: Contender was not elected as the leader within 200000ms"
[ https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228477#comment-17228477 ] Till Rohrmann commented on FLINK-20045: --- I think the underlying problem is that the {{TestingLeaderElectionEventHandler}} tries to do too much. Concretely, writing the leader information via the {{LeaderElectionDriver}}. I see two ways to solve the problem. Either, we use the {{DefaultLeaderElectionService}} whose task it is to write the leader information back or we introduce a {{start}} method which we can use to start the {{LeaderElectionDriver}}. That way, could initialize the {{TestingLeaderElectionEventHandler}} before starting the driver. > ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with > "TimeoutException: Contender was not elected as the leader within 20ms" > > > Key: FLINK-20045 > URL: https://issues.apache.org/jira/browse/FLINK-20045 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Critical > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > 2020-11-07T10:34:07.5063203Z [ERROR] > testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest) > Time elapsed: 202.445 s <<< ERROR! > 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender > was not elected as the leader within 20ms > 2020-11-07T10:34:07.5064946Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153) > 2020-11-07T10:34:07.5065762Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139) > 2020-11-07T10:34:07.5066565Z at > org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48) > 2020-11-07T10:34:07.5067185Z at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20045) ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with "TimeoutException: Contender was not elected as the leader within 200000ms"
[ https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228448#comment-17228448 ] Till Rohrmann commented on FLINK-20045: --- I think introducing sleeps is not a reliable way of fixing this problem. > ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with > "TimeoutException: Contender was not elected as the leader within 20ms" > > > Key: FLINK-20045 > URL: https://issues.apache.org/jira/browse/FLINK-20045 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Critical > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > 2020-11-07T10:34:07.5063203Z [ERROR] > testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest) > Time elapsed: 202.445 s <<< ERROR! > 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender > was not elected as the leader within 20ms > 2020-11-07T10:34:07.5064946Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153) > 2020-11-07T10:34:07.5065762Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139) > 2020-11-07T10:34:07.5066565Z at > org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48) > 2020-11-07T10:34:07.5067185Z at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20045) ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with "TimeoutException: Contender was not elected as the leader within 200000ms"
[ https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228330#comment-17228330 ] Yang Wang commented on FLINK-20045: --- I think the root cause is the leader was granted too fast and the {{TestingLeaderElectionEventHandler#init()}} has not been called. So we could find the following exception in the maven log. This could not happen in the production code since we have a {{lock}} in {{DefaultLeaderElectionService}}. How to fix the unstable tests? I suggest to add a "wait-with-timeout" in the {{TestingLeaderElectionEventHandler}} so that we have enough time for creating {{LeaderElectionDriver}} and then {{init}} the {{TestingLeaderElectionEventHandler}}. {code:java} 10:30:37,419 [Curator-LeaderLatch-0] WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.10:30:37,419 [Curator-LeaderLatch-0] WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.10:30:37,468 [ main-EventThread] ERROR org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer [] - Listener (ZooKeeperLeaderElectionDriver{leaderPath='/leader'}) threw an exceptionorg.apache.flink.util.FlinkRuntimeException: init() should be called first. at org.apache.flink.runtime.leaderelection.TestingLeaderElectionEventHandler.onGrantLeadership(TestingLeaderElectionEventHandler.java:46) ~[test-classes/:?] at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.isLeader(ZooKeeperLeaderElectionDriver.java:158) ~[classes/:?] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:693) ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:689) ~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:688) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:567) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.access$700(LeaderLatch.java:65) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:618) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:601) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) [flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0]10:33:58,379 [ main] INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver [] - Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader'}10:33:58,383 [ main] INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing Zookeeper
[jira] [Commented] (FLINK-20045) ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with "TimeoutException: Contender was not elected as the leader within 200000ms"
[ https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227925#comment-17227925 ] Yang Wang commented on FLINK-20045: --- I am having a look on this failed test. > ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with > "TimeoutException: Contender was not elected as the leader within 20ms" > > > Key: FLINK-20045 > URL: https://issues.apache.org/jira/browse/FLINK-20045 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > 2020-11-07T10:34:07.5063203Z [ERROR] > testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest) > Time elapsed: 202.445 s <<< ERROR! > 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender > was not elected as the leader within 20ms > 2020-11-07T10:34:07.5064946Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153) > 2020-11-07T10:34:07.5065762Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139) > 2020-11-07T10:34:07.5066565Z at > org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48) > 2020-11-07T10:34:07.5067185Z at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)