We found out there's a taskmanager.exit-on-fatal-akka-error property that
will restart flink in this situation but it is not enabled by default and
that feels like a rather blunt tool. I expect systems like this to be more
resilient to this

On 23 February 2018 at 14:42, Aljoscha Krettek <aljos...@apache.org> wrote:

> @Till Is this the expected behaviour or do you suspect something could be
> going wrong?
>
>
> On 23. Feb 2018, at 08:59, jelmer <jkupe...@gmail.com> wrote:
>
> We've observed on our flink 1.4.0 setup that if for some reason the
> networking between the task manager and the job manager gets disrupted then
> the task manager is never able to reconnect.
>
> You'll end up with messages like this getting printed to the log repeatedly
>
> Trying to register at JobManager 
> akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 17, timeout: 30000 
> milliseconds)
> Quarantined address [akka.tcp://flink@jobmanager:6123] is still unreachable 
> or has not been restarted. Keeping it quarantined.
>
>
> Or alternatively
>
>
> Tried to associate with unreachable remote address 
> [akka.tcp://flink@jobmanager:6123]. Address is now gated for 5000 ms, all 
> messages to this address will be delivered to dead letters. Reason: [The 
> remote system has quarantined this system. No further associations to the 
> remote system are possible until this system is restarted.
>
>
> But it never recovers until you either restart the job manager or the task
> manager
>
> I was able to successfully reproduce this behaviour in two docker
> containers here :
>
> https://github.com/jelmerk/flink-worker-not-rejoining
>
> Has anyone else seen this problem ?
>
>
>
>
>
>
>
>
>

Reply via email to