Re: Flink HA on AWS: Network related issue

Deepak Jha Sun, 11 Sep 2016 20:55:40 -0700

Hi Till,
One more thing i noticed after looking into following message in
taskmanager log


2016-09-11 17:57:25,310 PDT [WARN]  ip-10-6-0-15
[flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate with
unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address is
now gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: *The remote system has quarantined this system. No
further associations to the remote system are possible until this system is
restarted*.

So in this case ideally the local ActorSystem should go down so that
service supervisor/runit will restart the system and taskmanager will again
be able to connect to the remote system.. If it does not happen
automatically then we have to monitor logs in some way and then try to
ensure that it restarts. Ideally flink taskmanager Actor System should go
down. Please let me know if my understanding is wrong.





On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha <dkjhan...@gmail.com> wrote:

> Hi Till,
> I'm getting following message in Jobmanager log
>
> 2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - 
> *Detected
> unreachable: [akka.tcp://flink@10.8.4.57:6121
> <http://flink@10.8.4.57:6121>]*
> 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.JobManager
> - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager
> terminated.
> 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.InstanceManager
> - Unregistered task manager akka.tcp://flink@10.8.4.57:
> 6121/user/taskmanager. Number of registered task managers 2. Number of
> available slots 4.
> 2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-982] Remoting - Association to
> [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is
> irrecoverably failed. *UID is now quarantined and all messages to this
> UID will be delivered to dead letters. Remote actorsystem must be restarted
> to recover from this situation.*
> 2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
> Message [akka.remote.transport.AssociationHandle$Disassociated] from
> Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.8.4.57%
> 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%
> 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
> Message [akka.remote.transport.AssociationHandle$Disassociated] from
> Actor[akka://flink/deadLetters] to Actor[akka://flink/system/transports/
> akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%
> 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
> [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore -
> Recovering checkpoints from ZooKeeper.
>
> Hope it helps. I'm using Flink 1.0.2
>
> On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Hi Deepak,
>>
>> could you check the logs whether the JobManager has been quarantined and
>> thus, cannot be connected to anymore? The logs should at least contain a
>> hint why the TaskManager lost the connection initially.
>>
>> Cheers,
>> Till
>>
>> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <dkjhan...@gmail.com> wrote:
>>
>> > Hi,
>> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are
>> on
>> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works
>> > fine, but after few hours I do see that Taskmanagers gets detached with
>> > Jobmanager. I tried to reach Jobmanager using telnet at the same time
>> and
>> > it worked but Taskmanager does not succeed in connecting again. It
>> attaches
>> > only after I restart it. I tried following settings but still the
>> problem
>> > persists.
>> >
>> > akka.ask.timeout: 20 s
>> > akka.lookup.timeout: 20 s
>> > akka.watch.heartbeat.interval: 20 s
>> >
>> > Please find attached snapshot on one of the Taskmanager. Is there any
>> > setting that I need to do ?
>> >
>> > --
>> > Thanks,
>> > Deepak Jha
>> >
>> >
>>
>
>
>
> --
> Thanks,
> Deepak Jha
>
>


-- 
Thanks,
Deepak Jha

Re: Flink HA on AWS: Network related issue

Reply via email to