Hi Till, I'm getting following message in Jobmanager log 2016-09-09 07:46:55,093 PDT [WARN] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - *Detected unreachable: [akka.tcp://[email protected]:6121 <http://[email protected]:6121>]* 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp:// [email protected]:6121/user/taskmanager terminated. 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.InstanceManager - Unregistered task manager akka.tcp://[email protected]:6121/user/taskmanager. Number of registered task managers 2. Number of available slots 4. 2016-09-09 07:46:55,096 PDT [WARN] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-982] Remoting - Association to [akka.tcp://[email protected]:6121] having UID [-1223410403] is irrecoverably failed. *UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.* 2016-09-09 07:46:55,097 PDT [INFO] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef - Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://flink/deadLetters] to Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 2016-09-09 07:46:55,098 PDT [INFO] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef - Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://flink/deadLetters] to Actor[akka://flink/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 2016-09-09 07:46:58,479 PDT [INFO] ip-10-8-11-249 [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper.
Hope it helps. I'm using Flink 1.0.2 On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <[email protected]> wrote: > Hi Deepak, > > could you check the logs whether the JobManager has been quarantined and > thus, cannot be connected to anymore? The logs should at least contain a > hint why the TaskManager lost the connection initially. > > Cheers, > Till > > On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[email protected]> wrote: > > > Hi, > > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on > > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works > > fine, but after few hours I do see that Taskmanagers gets detached with > > Jobmanager. I tried to reach Jobmanager using telnet at the same time and > > it worked but Taskmanager does not succeed in connecting again. It > attaches > > only after I restart it. I tried following settings but still the problem > > persists. > > > > akka.ask.timeout: 20 s > > akka.lookup.timeout: 20 s > > akka.watch.heartbeat.interval: 20 s > > > > Please find attached snapshot on one of the Taskmanager. Is there any > > setting that I need to do ? > > > > -- > > Thanks, > > Deepak Jha > > > > > -- Thanks, Deepak Jha
