[DISCUSSION] Complete restart after successive failures

Gyula Fóra Sat, 29 Dec 2018 04:36:46 -0800

Hi all!

In the past years while running Flink in production we have seen a huge
number of scenarios when the Flink jobs can go into unrecoverable failure
loops and only a complete manual restart helps.


This is in most cases due to memory leaks in the user program, leaking
threads etc and it leads to a failure loop due to the fact that the job is
restarted within the same JVM (Taskmanager). After the restart the leak
gets worse and worse eventually crashing the TMs one after the other and
never recovering.

These issues are extremely hard to debug (might only cause problems after a
few failures) and can cause long lasting instabilities.

I suggest we enable an option that would trigger a complete restart every
so many failures. This would release all containers (TM and JM) and restart
everything.

The only argument against this I see is that this might further hide the
root cause of the problem on the job/user side. While this is true a stuck
production job with crashing TM is probably much worse out of these 2.

What do you think?

Gyula

[DISCUSSION] Complete restart after successive failures

Reply via email to