Hi Gyula,

I see the benefit of having such an option. In fact, it goes in a similar
direction as the currently ongoing discussion about blacklisting TMs. In
the end it could work by reporting failures to the RM which aggregates some
statistics for the individual TMs. Based on some thresholds it could then
decide to free/blacklist a specific TM. Whether to blacklist or restart a
container could then be a configurable option.

Cheers,
Till

On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <pi...@da-platform.com> wrote:

> Hi Gyula,
>
> Personally I do not see a problem with providing such an option of “clean
> restart” after N failures, especially if we set the default value for N to
> +infinity. However guys working more with Flink’s scheduling systems might
> have more to say about this.
>
> Piotrek
>
> > On 29 Dec 2018, at 13:36, Gyula Fóra <gyula.f...@gmail.com> wrote:
> >
> > Hi all!
> >
> > In the past years while running Flink in production we have seen a huge
> > number of scenarios when the Flink jobs can go into unrecoverable failure
> > loops and only a complete manual restart helps.
> >
> > This is in most cases due to memory leaks in the user program, leaking
> > threads etc and it leads to a failure loop due to the fact that the job
> is
> > restarted within the same JVM (Taskmanager). After the restart the leak
> > gets worse and worse eventually crashing the TMs one after the other and
> > never recovering.
> >
> > These issues are extremely hard to debug (might only cause problems
> after a
> > few failures) and can cause long lasting instabilities.
> >
> > I suggest we enable an option that would trigger a complete restart every
> > so many failures. This would release all containers (TM and JM) and
> restart
> > everything.
> >
> > The only argument against this I see is that this might further hide the
> > root cause of the problem on the job/user side. While this is true a
> stuck
> > production job with crashing TM is probably much worse out of these 2.
> >
> > What do you think?
> >
> > Gyula
>
>

Reply via email to