[ https://issues.apache.org/jira/browse/FLINK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931030#comment-16931030 ]
Yun Tang commented on FLINK-14091: ---------------------------------- >From the error description, I think {{SharedCountConnectionStateListener}} >should also handle the connection state after {{RECONNECTED}}, CC [~uce] > Job can not trigger checkpoint forever after zookeeper change leader > --------------------------------------------------------------------- > > Key: FLINK-14091 > URL: https://issues.apache.org/jira/browse/FLINK-14091 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.9.0 > Reporter: Peng Wang > Priority: Minor > > when zk change leader, the state of curator is suspended,job manager can not > tigger checkpoint.but it doesn't tigger checkpoint after zk resume. > we found that the lastState in the class ZooKeeperCheckpointIDCounter never > change back to normal when it fall into SUSPENDED or LOST. > h6. _/**_ > _* Connection state listener. In case of \{@link ConnectionState#SUSPENDED} > or {@link_ > _* ConnectionState#LOST} we are not guaranteed to read a current count from > ZooKeeper._ > _*/_ > _private static class SharedCountConnectionStateListener implements > ConnectionStateListener {_ > _private volatile ConnectionState lastState;_ > _@Override_ > _public void stateChanged(CuratorFramework client, ConnectionState newState) > {_ > _if (newState == ConnectionState.SUSPENDED || newState == > ConnectionState.LOST) {_ > _lastState = newState;_ > _}_ > _}_ > _private ConnectionState getLastState() {_ > _return lastState;_ > _}_ > _}_ > > we change the state back. after test, solve the problem. > > h6. _/**_ > _* Connection state listener. In case of \{@link ConnectionState#SUSPENDED} > or {@link_ > _* ConnectionState#LOST} we are not guaranteed to read a current count from > ZooKeeper._ > _*/_ > _private static class SharedCountConnectionStateListener implements > ConnectionStateListener {_ > _private volatile ConnectionState lastState;_ > _@Override_ > _public void stateChanged(CuratorFramework client, ConnectionState newState) > {_ > _if (newState == ConnectionState.SUSPENDED || newState == > ConnectionState.LOST) {_ > _lastState = newState;_ > _}_ > _else{_ > _/* if connectionState is not SUSPENDED and LOST, reset lastState. */_ > _lastState = null;_ > _}_ > _}_ > _private ConnectionState getLastState() {_ > _return lastState;_ > _}_ > _}_ > > log: > h6. {{{{2019-09-16 13:38:38,020 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable > to }}{{read}} {{additional data from server sessionid 0x26cff6487c2000e, > likely server has closed socket, closing socket connection and attempting > reconnect}}}}{{{{2019-09-16 13:38:38,122 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: SUSPENDED}}}}{{{{2019-09-16 13:38:38,123 WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not > monitored (temporarily).}}}}{{{{2019-09-16 13:38:38,128 WARN > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper suspended. The contender > akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/dispatcher}} {{no longer > participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128 > WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService > - Connection to ZooKeeper suspended. The contender > akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/resourcemanager}} {{no > longer participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 > 13:38:38,128 WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper suspended. The contender > http:}}{{//node007224}}{{:8081 no longer participates }}{{in}} {{the leader > election.}}}}{{{{2019-09-16 13:38:38,128 WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper suspended. The contender > akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/jobmanager_2}} {{no longer > participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128 > WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper.}}}}{{{{2019-09-16 13:38:39,109 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified > JAAS configuration }}{{file}}{{: > }}{{'/tmp/jaas-4823064314619540149.conf'}}{{. Will }}{{continue}} > {{connection to Zookeeper server without SASL authentication, }}{{if}} > {{Zookeeper server allows it.}}}}{{{{2019-09-16 13:38:39,109 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening > socket connection to server > 192.168.7.231}}{{/192}}{{.168.7.231:2181}}}}{{{{2019-09-16 13:38:39,109 ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed}}}}{{{{2019-09-16 13:38:39,110 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket > connection established to 192.168.7.231}}{{/192}}{{.168.7.231:2181, > initiating session}}}}{{{{2019-09-16 13:38:39,112 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable > to }}{{read}} {{additional data from server sessionid 0x26cff6487c2000e, > likely server has closed socket, closing socket connection and attempting > reconnect}}}}{{{{2019-09-16 13:38:39,778 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified > JAAS configuration }}{{file}}{{: > }}{{'/tmp/jaas-4823064314619540149.conf'}}{{. Will }}{{continue}} > {{connection to Zookeeper server without SASL authentication, }}{{if}} > {{Zookeeper server allows it.}}}}{{{{2019-09-16 13:38:39,778 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening > socket connection to server > 192.168.7.230}}{{/192}}{{.168.7.230:2181}}}}{{{{2019-09-16 13:38:39,778 ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed}}}}{{{{2019-09-16 13:38:39,778 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket > connection established to 192.168.7.230}}{{/192}}{{.168.7.230:2181, > initiating session}}}}{{{{2019-09-16 13:38:39,780 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session > establishment complete on server 192.168.7.230}}{{/192}}{{.168.7.230:2181, > sessionid = 0x26cff6487c2000e, negotiated timeout = 60000}}}}{{{{2019-09-16 > 13:38:39,780 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: RECONNECTED}}}}{{{{2019-09-16 13:38:39,780 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper was reconnected. Leader retrieval can be > restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper was reconnected. Leader retrieval can be > restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper was reconnected. Leader election can be > restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are > monitored again.}}}}{{{{2019-09-16 13:38:39,780 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper was reconnected. Leader election can be > restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper was reconnected. Leader retrieval can be > restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper was reconnected. Leader election can be > restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper was reconnected. Leader retrieval can be > restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper was reconnected. Leader election can be > restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper was reconnected. Leader retrieval can be > restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper was reconnected. Leader retrieval can be > restarted.}}}}{{{{2019-09-16 13:38:43,142 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 6995 }}{{for}} {{job 21b6ef566750f5766443641254e8e1a9 (16841 bytes > }}{{in}} {{49 ms).}}}}{{{{2019-09-16 13:38:43,144 ERROR > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Exception > }}{{while}} {{triggering checkpoint }}{{for}} {{job > 21b6ef566750f5766443641254e8e1a9.}}}}{{{{java.lang.IllegalStateException: > Connection state: SUSPENDED}}}}{{{{ }}{{at > org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.checkConnectionState(ZooKeeperCheckpointIDCounter.java:159)}}}}{{{{ > }}{{at > org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.get(ZooKeeperCheckpointIDCounter.java:133)}}}}{{{{ > }}{{at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:448)}}}}{{{{ > }}{{at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$ScheduledTrigger.run(CheckpointCoordinator.java:1323)}}}}{{{{ > }}{{at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}}}{{{{ > }}{{at > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)}}}}{{{{ > }}{{at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)}}}}{{{{ > }}{{at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)}}}}{{{{ > }}{{at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)}}}}{{{{ > }}{{at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)}}}}{{{{ > }}{{at java.lang.Thread.run(Thread.java:745)}}}} -- This message was sent by Atlassian Jira (v8.3.2#803003)