Hi Deepak,

it seems that the JobManager's deathwatch declares the TaskManager to be
unreachable which will automatically quarantine it. You're right that in
such a case, the TaskManager should shut down and be restarted so that it
can again reconnect to the JobManager. This is, however, not yet supported
automatically.

For the moment, I'd recommend you to make the deathwatch a little bit less
aggressive via the following settings:

- akka.watch.heartbeat.interval: Heartbeat interval for Akka’s DeathWatch
mechanism to detect dead TaskManagers. If TaskManagers are wrongly marked
dead because of lost or delayed heartbeat messages, then you should
increase this value. A thorough description of Akka’s DeathWatch can be
found here (DEFAULT: akka.ask.timeout/10).
- akka.watch.heartbeat.pause: Acceptable heartbeat pause for Akka’s
DeathWatch mechanism. A low value does not allow a irregular heartbeat. A
thorough description of Akka’s DeathWatch can be found here (DEFAULT:
akka.ask.timeout).
- akka.watch.threshold: Threshold for the DeathWatch failure detector. A
low value is prone to false positives whereas a high value increases the
time to detect a dead TaskManager. A thorough description of Akka’s
DeathWatch can be found here (DEFAULT: 12).

I hope this helps you to work around the problem for the moment until we've
added the automatic shut down and restart.

Cheers,
Till

On Mon, Sep 12, 2016 at 5:55 AM, Deepak Jha <dkjhan...@gmail.com> wrote:

> Hi Till,
> One more thing i noticed after looking into following message in
> taskmanager log
>
> 2016-09-11 17:57:25,310 PDT [WARN]  ip-10-6-0-15
> [flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate
> with
> unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address is
> now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: *The remote system has quarantined this system. No
> further associations to the remote system are possible until this system is
> restarted*.
>
> So in this case ideally the local ActorSystem should go down so that
> service supervisor/runit will restart the system and taskmanager will again
> be able to connect to the remote system.. If it does not happen
> automatically then we have to monitor logs in some way and then try to
> ensure that it restarts. Ideally flink taskmanager Actor System should go
> down. Please let me know if my understanding is wrong.
>
>
>
>
>
> On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha <dkjhan...@gmail.com> wrote:
>
> > Hi Till,
> > I'm getting following message in Jobmanager log
> >
> > 2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher -
> *Detected
> > unreachable: [akka.tcp://flink@10.8.4.57:6121
> > <http://flink@10.8.4.57:6121>]*
> > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.
> JobManager
> > - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager
> > terminated.
> > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.
> InstanceManager
> > - Unregistered task manager akka.tcp://flink@10.8.4.57:
> > 6121/user/taskmanager. Number of registered task managers 2. Number of
> > available slots 4.
> > 2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-982] Remoting - Association to
> > [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is
> > irrecoverably failed. *UID is now quarantined and all messages to this
> > UID will be delivered to dead letters. Remote actorsystem must be
> restarted
> > to recover from this situation.*
> > 2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
> > Message [akka.remote.transport.AssociationHandle$Disassociated] from
> > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> > endpointManager/reliableEndpointWriter-akka.
> tcp%3A%2F%2Fflink%4010.8.4.57%
> > 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%
> > 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead
> > letters encountered. This logging can be turned off or adjusted with
> > configuration settings 'akka.log-dead-letters' and
> > 'akka.log-dead-letters-during-shutdown'.
> > 2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
> > Message [akka.remote.transport.AssociationHandle$Disassociated] from
> > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/transports/
> > akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%
> > 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead
> > letters encountered. This logging can be turned off or adjusted with
> > configuration settings 'akka.log-dead-letters' and
> > 'akka.log-dead-letters-during-shutdown'.
> > 2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
> > [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore -
> > Recovering checkpoints from ZooKeeper.
> >
> > Hope it helps. I'm using Flink 1.0.2
> >
> > On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <trohrm...@apache.org>
> > wrote:
> >
> >> Hi Deepak,
> >>
> >> could you check the logs whether the JobManager has been quarantined and
> >> thus, cannot be connected to anymore? The logs should at least contain a
> >> hint why the TaskManager lost the connection initially.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <dkjhan...@gmail.com> wrote:
> >>
> >> > Hi,
> >> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are
> >> on
> >> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology
> works
> >> > fine, but after few hours I do see that Taskmanagers gets detached
> with
> >> > Jobmanager. I tried to reach Jobmanager using telnet at the same time
> >> and
> >> > it worked but Taskmanager does not succeed in connecting again. It
> >> attaches
> >> > only after I restart it. I tried following settings but still the
> >> problem
> >> > persists.
> >> >
> >> > akka.ask.timeout: 20 s
> >> > akka.lookup.timeout: 20 s
> >> > akka.watch.heartbeat.interval: 20 s
> >> >
> >> > Please find attached snapshot on one of the Taskmanager. Is there any
> >> > setting that I need to do ?
> >> >
> >> > --
> >> > Thanks,
> >> > Deepak Jha
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Thanks,
> > Deepak Jha
> >
> >
>
>
> --
> Thanks,
> Deepak Jha
>

Reply via email to