[ 
https://issues.apache.org/jira/browse/FLINK-18480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias closed FLINK-18480.
----------------------------
    Resolution: Won't Fix

I'm gonna close this issue for now. [~AlberyZJG] feel free to open it again if 
you want to discuss again it or if you're able to reproduce the problem with a 
new version of Flink.

> JobManager lost zk connection, HA can not work
> ----------------------------------------------
>
>                 Key: FLINK-18480
>                 URL: https://issues.apache.org/jira/browse/FLINK-18480
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.7.2
>            Reporter: Zhang Jianguo
>            Priority: Major
>
> *Flink version 1.7.2*
> *At this time we got network exception and lost connection to zookeeper*
> 2020-06-29 21:05:39,540 | INFO  | [flink-akka.actor.default-dispatcher-601] | 
> Closing TaskExecutor connection container_e15_1590623517232_44463_01_000002 
> {color:#de350b}*because: ResourceManager leader changed to new address 
> null*{color} | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:822) 2020-06-29 21:05:39,540 | INFO  | 
> [flink-akka.actor.default-dispatcher-601] | Closing TaskExecutor connection 
> container_e15_1590623517232_44463_01_000002 because: ResourceManager leader 
> changed to new address null | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:822) 2020-06-29 21:05:39,541 | INFO  | 
> [flink-akka.actor.default-dispatcher-494] | Map -> Flat Map -> (Sink: 
> KafkaParquetSink, Filter -> Map) (12/44) (e8fe39f7a15cce3898ba60c1f422c735) 
> switched from RUNNING to FAILED. | 
> org.apache.flink.runtime.executiongraph.ExecutionGraph (Execution.java:1342) 
> java.lang.Exception: Job leader for job id d42515dde773214b841cef21d1549808 
> lost leadership. at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1526)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  
> *I filterd the change of zk connection state.*
> 2020-06-29 21:05:39,609 | INFO | [Curator-ConnectionStateManager-0] | 
> *{color:#de350b}new state: SUSPENDED{color}* | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,611 | INFO | [Curator-ConnectionStateManager-0] | 
> {color:#de350b}*new state: SUSPENDED*{color} | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,615 | INFO | [Curator-ConnectionStateManager-0] | 
> *{color:#de350b}new state: SUSPENDED{color}* | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,616 | INFO | [Curator-ConnectionStateManager-0] | 
> {color:#de350b}*new state: SUSPENDED*{color} | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,710 | INFO | [Curator-ConnectionStateManager-0] | new 
> state: LOST | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,725 | INFO | [Curator-ConnectionStateManager-0] | new 
> state: LOST | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,731 | INFO | [Curator-ConnectionStateManager-0] | new 
> state: LOST | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,736 | INFO | [Curator-ConnectionStateManager-0] | 
> {color:#de350b}*new state: LOST*{color} | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,933 | INFO | [Curator-ConnectionStateManager-0] | new 
> state: RECONNECTED | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,941 | INFO | [Curator-ConnectionStateManager-0] | new 
> state: RECONNECTED | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,941 | INFO | [Curator-ConnectionStateManager-0] | new 
> state: RECONNECTED | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530) 
> 2020-06-29 21:05:39,941 | INFO | [Curator-ConnectionStateManager-0] | 
> {color:#de350b}*new state: RECONNECTED*{color} | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:530)
>  
> *After reconnected to ZK, flink tried to recover all jobs.*
> 2020-06-29 21:05:39,953 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Initializing job (d42515dde773214b841cef21d1549808). | 
> org.apache.flink.runtime.jobmaster.JobMaster (JobMaster.java:271) 
> 2020-06-29 21:05:39,954 | INFO | [flink-akka.actor.default-dispatcher-647] | 
> ResourceManager akka.ssl.tcp://flink@anawrk00002-3:32586/user/resourcemanager 
> was granted leadership with fencing token 909783bdcc87da86ee9c7af38ed44f3a | 
> org.apache.flink.yarn.YarnResourceManager (ResourceManager.java:955) 
> 2020-06-29 21:05:39,954 | INFO | [flink-akka.actor.default-dispatcher-647] | 
> Starting the SlotManager. | 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager 
> (SlotManager.java:198) 
> 2020-06-29 21:05:39,983 | INFO | [flink-akka.actor.default-dispatcher-593] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000010 
> (akka.ssl.tcp://flink@anawrk00015-3:32326/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:39,983 | INFO | [flink-akka.actor.default-dispatcher-593] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000005 
> (akka.ssl.tcp://flink@anawrk00001-3:32327/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:39,988 | INFO | [flink-akka.actor.default-dispatcher-593] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000009 
> (akka.ssl.tcp://flink@anawrk00011-3:32326/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:39,988 | INFO | [flink-akka.actor.default-dispatcher-593] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000007 
> (akka.ssl.tcp://flink@anawrk00010-3:32326/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:39,990 | INFO | [flink-akka.actor.default-dispatcher-593] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000004 
> (akka.ssl.tcp://flink@anawrk00006-3:32326/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:39,991 | INFO | [flink-akka.actor.default-dispatcher-593] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000008 
> (akka.ssl.tcp://flink@anawrk00002-3:32326/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:39,993 | INFO | [flink-akka.actor.default-dispatcher-601] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000002 
> (akka.ssl.tcp://flink@anawrk00001-3:32326/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:39,993 | INFO | [flink-akka.actor.default-dispatcher-601] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000003 
> (akka.ssl.tcp://flink@anadat00002-3:32327/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:39,994 | INFO | [flink-akka.actor.default-dispatcher-601] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000011 
> (akka.ssl.tcp://flink@anawrk00014-3:32326/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:39,994 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Using restart strategy NoRestartStrategy for 
> (d42515dde773214b841cef21d1549808). | 
> org.apache.flink.runtime.jobmaster.JobMaster (JobMaster.java:282) 
> 2020-06-29 21:05:39,995 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Starting RPC endpoint for 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool at 
> akka://flink/user/59b319a7-00aa-4bf1-b50d-bc7267fe2189 . | 
> org.apache.flink.runtime.rpc.akka.AkkaRpcService (AkkaRpcService.java:224) 
> 2020-06-29 21:05:40,000 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Job recovers via failover strategy: full graph restart | 
> org.apache.flink.runtime.executiongraph.ExecutionGraph 
> (ExecutionGraph.java:428) 
> 2020-06-29 21:05:40,001 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Running initialization on master for job (d42515dde773214b841cef21d1549808). 
> | org.apache.flink.runtime.jobmaster.JobMaster 
> (ExecutionGraphBuilder.java:195) 
> 2020-06-29 21:05:40,001 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Successfully ran initialization on master in 0 ms. | 
> org.apache.flink.runtime.jobmaster.JobMaster (ExecutionGraphBuilder.java:224) 
> 2020-06-29 21:05:40,008 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Initialized in '/checkpoints/d42515dde773214b841cef21d1549808'. | 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> (ZooKeeperCompletedCheckpointStore.java:137) 
> 2020-06-29 21:05:40,009 | WARN | [flink-akka.actor.default-dispatcher-494] | 
> Config uses deprecated configuration key 'jobmanager.web.checkpoints.history' 
> instead of proper key 'web.checkpoints.history' | 
> org.apache.flink.configuration.Configuration (Configuration.java:797) 
> 2020-06-29 21:05:40,009 | WARN | [flink-akka.actor.default-dispatcher-494] | 
> Config uses deprecated configuration key 'state.backend.fs.checkpointdir' 
> instead of proper key 'state.checkpoints.dir' | 
> org.apache.flink.configuration.Configuration (Configuration.java:797) 
> 2020-06-29 21:05:40,009 | WARN | [flink-akka.actor.default-dispatcher-494] | 
> Config uses deprecated configuration key 'state.backend.fs.checkpointdir' 
> instead of proper key 'state.checkpoints.dir' | 
> org.apache.flink.configuration.Configuration (Configuration.java:797) 
> 2020-06-29 21:05:40,009 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> No state backend has been configured, using default (Memory / JobManager) 
> MemoryStateBackend (data in heap memory / checkpoints to JobManager) 
> (checkpoints: 'hdfs:/tsp/flink/checkpoints', savepoints: 
> 'hdfs:/tsp/flink/savepoint', asynchronous: TRUE, maxStateSize: 5242880) | 
> org.apache.flink.runtime.jobmaster.JobMaster (StateBackendLoader.java:230) 
> 2020-06-29 21:05:40,035 | INFO | [flink-akka.actor.default-dispatcher-647] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000006 
> (akka.ssl.tcp://flink@anadat00002-3:32326/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730) 
> 2020-06-29 21:05:40,089 | INFO | [flink-akka.actor.default-dispatcher-647] | 
> Registering TaskManager with ResourceID 
> container_e15_1590623517232_44463_01_000012 
> (akka.ssl.tcp://flink@anawrk00005-3:32326/user/taskmanager_0) at 
> ResourceManager | org.apache.flink.yarn.YarnResourceManager 
> (ResourceManager.java:730)
> ...
> 2020-06-29 21:05:41,978 | INFO | [flink-akka.actor.default-dispatcher-601] | 
> Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId\{5f26e3fd030369472eb3b9225d97def8}] | 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool (SlotPool.java:753)
> ...
> 2020-06-29 21:05:41,997 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Connecting to ResourceManager 
> akka.ssl.tcp://flink@anawrk00002-3:32586/user/resourcemanager(909783bdcc87da86ee9c7af38ed44f3a)
>  | org.apache.flink.runtime.jobmaster.JobMaster (JobMaster.java:1307) 
> 2020-06-29 21:05:41,998 | INFO | [flink-akka.actor.default-dispatcher-647] | 
> Resolved ResourceManager address, beginning registration | 
> org.apache.flink.runtime.jobmaster.JobMaster (RetryingRegistration.java:201) 
> 2020-06-29 21:05:41,998 | INFO | [flink-akka.actor.default-dispatcher-647] | 
> Registration at ResourceManager attempt 1 (timeout=100ms) | 
> org.apache.flink.runtime.jobmaster.JobMaster (RetryingRegistration.java:250) 
> 2020-06-29 21:05:41,998 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Starting ZooKeeperLeaderRetrievalService 
> /leader/d42515dde773214b841cef21d1549808/job_manager_lock. | 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService 
> (ZooKeeperLeaderRetrievalService.java:100) 
> 2020-06-29 21:05:41,999 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Registering job manager 
> bff6deab98cf97541807aefbd57a4...@akka.ssl.tcp://flink@anawrk00002-3:32586/user/jobmanager_1
>  for job d42515dde773214b841cef21d1549808. | 
> org.apache.flink.yarn.YarnResourceManager (ResourceManager.java:295) 
> 2020-06-29 21:05:42,004 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> Registered job manager 
> bff6deab98cf97541807aefbd57a4...@akka.ssl.tcp://flink@anawrk00002-3:32586/user/jobmanager_1
>  for job d42515dde773214b841cef21d1549808. | 
> org.apache.flink.yarn.YarnResourceManager (ResourceManager.java:676) 
> 2020-06-29 21:05:42,005 | INFO | [flink-akka.actor.default-dispatcher-494] | 
> JobManager successfully registered at ResourceManager, leader id: 
> 909783bdcc87da86ee9c7af38ed44f3a. | 
> org.apache.flink.runtime.jobmaster.JobMaster (JobMaster.java:1329) 
> 2020-06-29 21:05:42,005 | INFO | [flink-akka.actor.default-dispatcher-647] | 
> Requesting new slot [SlotRequestId\{4ea750e557da048edf908efa6ad86564}] and 
> profile ResourceProfile\{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource 
> manager. | org.apache.flink.runtime.jobmaster.slotpool.SlotPool 
> (SlotPool.java:709)
>  
> Then, maybe network exception happened again.
> 2020-06-29 21:05:42,615 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 3 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:42,617 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 3 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:42,617 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 3 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:42,631 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 3 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:43,616 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 4 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:43,617 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 4 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:43,617 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 4 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:43,632 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 4 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:44,617 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 5 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:44,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 5 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:44,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 5 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:44,632 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 5 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:45,617 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 6 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:45,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 6 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:45,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 6 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:45,632 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 6 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:46,617 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 7 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:46,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 7 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:46,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 7 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:46,632 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 7 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:47,617 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 8 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:47,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 8 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:47,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 8 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:47,632 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 8 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:48,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 9 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:48,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 9 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:48,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 9 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:48,632 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 9 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:49,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:49,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:49,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:49,633 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:50,618 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:50,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:50,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:50,633 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:51,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 12 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:51,620 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 12 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:51,620 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 12 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:51,633 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 12 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:52,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 13 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:52,620 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 13 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:52,620 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 13 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:52,633 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 13 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:53,619 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 14 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:53,620 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 14 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:53,620 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 14 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:53,634 | INFO | [Suspend state waiting handler] | Connection 
> to Zookeeper is SUSPENDED. Wait it to be back. Already waited 14 seconds. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2020-06-29 21:05:54,619 | ERROR | [Suspend state waiting handler] | We've 
> lost connection to Zookeeper. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:584) 
> 2020-06-29 21:05:54,620 | INFO | [Suspend state waiting handler] | 
> https://172.28.16.13:32261 lost leadership | 
> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint 
> (WebMonitorEndpoint.java:763) 
> 2020-06-29 21:05:54,620 | ERROR | [Suspend state waiting handler] | 
> *{color:#de350b}We've lost connection to Zookeeper.{color}* | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:584) 
> 2020-06-29 21:05:54,621 | ERROR | [Suspend state waiting handler] | We've 
> lost connection to Zookeeper. | 
> org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:584)
>  
> {color:#172b4d}*Finally and critically, process suspended here, no more 
> updating and exception until we found and kill it.*{color}
> 2020-06-29 21:06:10,709 | INFO | [flink-akka.actor.default-dispatcher-574] | 
> ResourceManager akka.ssl.tcp://flink@anawrk00002-3:32586/user/resourcemanager 
> was revoked leadership. Clearing fencing token. | 
> org.apache.flink.yarn.YarnResourceManager (ResourceManager.java:979) 
> 2020-06-29 21:06:10,709 | INFO | [flink-akka.actor.default-dispatcher-574] | 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/d42515dde773214b841cef21d1549808/job_manager_lock. | 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService 
> (ZooKeeperLeaderRetrievalService.java:117) 
> 2020-06-29 21:06:10,710 | INFO | [flink-akka.actor.default-dispatcher-574] | 
> *{color:#de350b}Suspending the SlotManager. | 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager 
> (SlotManager.java:225){color}*
>  
> {color:#de350b}*Is there expert fimilar with this situation?*{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to