Hi Gyula,

Personally I do not see a problem with providing such an option of “clean 
restart” after N failures, especially if we set the default value for N to 
+infinity. However guys working more with Flink’s scheduling systems might have 
more to say about this.

Piotrek

> On 29 Dec 2018, at 13:36, Gyula Fóra <gyula.f...@gmail.com> wrote:
> 
> Hi all!
> 
> In the past years while running Flink in production we have seen a huge
> number of scenarios when the Flink jobs can go into unrecoverable failure
> loops and only a complete manual restart helps.
> 
> This is in most cases due to memory leaks in the user program, leaking
> threads etc and it leads to a failure loop due to the fact that the job is
> restarted within the same JVM (Taskmanager). After the restart the leak
> gets worse and worse eventually crashing the TMs one after the other and
> never recovering.
> 
> These issues are extremely hard to debug (might only cause problems after a
> few failures) and can cause long lasting instabilities.
> 
> I suggest we enable an option that would trigger a complete restart every
> so many failures. This would release all containers (TM and JM) and restart
> everything.
> 
> The only argument against this I see is that this might further hide the
> root cause of the problem on the job/user side. While this is true a stuck
> production job with crashing TM is probably much worse out of these 2.
> 
> What do you think?
> 
> Gyula

Reply via email to