Hi Gyula, Personally I do not see a problem with providing such an option of “clean restart” after N failures, especially if we set the default value for N to +infinity. However guys working more with Flink’s scheduling systems might have more to say about this.
Piotrek > On 29 Dec 2018, at 13:36, Gyula Fóra <gyula.f...@gmail.com> wrote: > > Hi all! > > In the past years while running Flink in production we have seen a huge > number of scenarios when the Flink jobs can go into unrecoverable failure > loops and only a complete manual restart helps. > > This is in most cases due to memory leaks in the user program, leaking > threads etc and it leads to a failure loop due to the fact that the job is > restarted within the same JVM (Taskmanager). After the restart the leak > gets worse and worse eventually crashing the TMs one after the other and > never recovering. > > These issues are extremely hard to debug (might only cause problems after a > few failures) and can cause long lasting instabilities. > > I suggest we enable an option that would trigger a complete restart every > so many failures. This would release all containers (TM and JM) and restart > everything. > > The only argument against this I see is that this might further hide the > root cause of the problem on the job/user side. While this is true a stuck > production job with crashing TM is probably much worse out of these 2. > > What do you think? > > Gyula