Hi all! In the past years while running Flink in production we have seen a huge number of scenarios when the Flink jobs can go into unrecoverable failure loops and only a complete manual restart helps.
This is in most cases due to memory leaks in the user program, leaking threads etc and it leads to a failure loop due to the fact that the job is restarted within the same JVM (Taskmanager). After the restart the leak gets worse and worse eventually crashing the TMs one after the other and never recovering. These issues are extremely hard to debug (might only cause problems after a few failures) and can cause long lasting instabilities. I suggest we enable an option that would trigger a complete restart every so many failures. This would release all containers (TM and JM) and restart everything. The only argument against this I see is that this might further hide the root cause of the problem on the job/user side. While this is true a stuck production job with crashing TM is probably much worse out of these 2. What do you think? Gyula