Re: [DISCUSSION] Complete restart after successive failures

Gyula Fóra Fri, 04 Jan 2019 01:32:05 -0800

Could it also work so that after so many tries it blacklists everything?
That way it would pretty much trigger a fresh restart.


Gyula

On Fri, 4 Jan 2019 at 10:11, Piotr Nowojski <[email protected]> wrote:

> That’s a good point Till. Blacklisting TMs could be able to handle this.
> One scenario that might be problematic is if clean restart is needed after
> a more or less random number of job resubmissions, like if resource leakage
> has different rates on different nodes. In such situation, if we blacklist
> and restart TMs one by one, Job can keep failing constantly with failures
> caused every time by a different TM. It could end up with a dead loop in
> some scenarios/setups. Where the Gyula’s proposal would restart all of the
> TMs at once, reseting the leakage on all of the TMs at the same time,
> making a successful restart possible.
>
> I still think that blacklisting TMs is a better way to do it, but maybe we
> still need some kind of limit, like after N blacklists restart all TMs. But
> this would also add an additional complexity.
>
> Piotrek
>
> > On 3 Jan 2019, at 13:59, Till Rohrmann <[email protected]> wrote:
> >
> > Hi Gyula,
> >
> > I see the benefit of having such an option. In fact, it goes in a similar
> > direction as the currently ongoing discussion about blacklisting TMs. In
> > the end it could work by reporting failures to the RM which aggregates
> some
> > statistics for the individual TMs. Based on some thresholds it could then
> > decide to free/blacklist a specific TM. Whether to blacklist or restart a
> > container could then be a configurable option.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <[email protected]>
> wrote:
> >
> >> Hi Gyula,
> >>
> >> Personally I do not see a problem with providing such an option of
> “clean
> >> restart” after N failures, especially if we set the default value for N
> to
> >> +infinity. However guys working more with Flink’s scheduling systems
> might
> >> have more to say about this.
> >>
> >> Piotrek
> >>
> >>> On 29 Dec 2018, at 13:36, Gyula Fóra <[email protected]> wrote:
> >>>
> >>> Hi all!
> >>>
> >>> In the past years while running Flink in production we have seen a huge
> >>> number of scenarios when the Flink jobs can go into unrecoverable
> failure
> >>> loops and only a complete manual restart helps.
> >>>
> >>> This is in most cases due to memory leaks in the user program, leaking
> >>> threads etc and it leads to a failure loop due to the fact that the job
> >> is
> >>> restarted within the same JVM (Taskmanager). After the restart the leak
> >>> gets worse and worse eventually crashing the TMs one after the other
> and
> >>> never recovering.
> >>>
> >>> These issues are extremely hard to debug (might only cause problems
> >> after a
> >>> few failures) and can cause long lasting instabilities.
> >>>
> >>> I suggest we enable an option that would trigger a complete restart
> every
> >>> so many failures. This would release all containers (TM and JM) and
> >> restart
> >>> everything.
> >>>
> >>> The only argument against this I see is that this might further hide
> the
> >>> root cause of the problem on the job/user side. While this is true a
> >> stuck
> >>> production job with crashing TM is probably much worse out of these 2.
> >>>
> >>> What do you think?
> >>>
> >>> Gyula
> >>
> >>
>
>

Re: [DISCUSSION] Complete restart after successive failures

Reply via email to