Could it also work so that after so many tries it blacklists everything? That way it would pretty much trigger a fresh restart.
Gyula On Fri, 4 Jan 2019 at 10:11, Piotr Nowojski <pi...@da-platform.com> wrote: > That’s a good point Till. Blacklisting TMs could be able to handle this. > One scenario that might be problematic is if clean restart is needed after > a more or less random number of job resubmissions, like if resource leakage > has different rates on different nodes. In such situation, if we blacklist > and restart TMs one by one, Job can keep failing constantly with failures > caused every time by a different TM. It could end up with a dead loop in > some scenarios/setups. Where the Gyula’s proposal would restart all of the > TMs at once, reseting the leakage on all of the TMs at the same time, > making a successful restart possible. > > I still think that blacklisting TMs is a better way to do it, but maybe we > still need some kind of limit, like after N blacklists restart all TMs. But > this would also add an additional complexity. > > Piotrek > > > On 3 Jan 2019, at 13:59, Till Rohrmann <trohrm...@apache.org> wrote: > > > > Hi Gyula, > > > > I see the benefit of having such an option. In fact, it goes in a similar > > direction as the currently ongoing discussion about blacklisting TMs. In > > the end it could work by reporting failures to the RM which aggregates > some > > statistics for the individual TMs. Based on some thresholds it could then > > decide to free/blacklist a specific TM. Whether to blacklist or restart a > > container could then be a configurable option. > > > > Cheers, > > Till > > > > On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <pi...@da-platform.com> > wrote: > > > >> Hi Gyula, > >> > >> Personally I do not see a problem with providing such an option of > “clean > >> restart” after N failures, especially if we set the default value for N > to > >> +infinity. However guys working more with Flink’s scheduling systems > might > >> have more to say about this. > >> > >> Piotrek > >> > >>> On 29 Dec 2018, at 13:36, Gyula Fóra <gyula.f...@gmail.com> wrote: > >>> > >>> Hi all! > >>> > >>> In the past years while running Flink in production we have seen a huge > >>> number of scenarios when the Flink jobs can go into unrecoverable > failure > >>> loops and only a complete manual restart helps. > >>> > >>> This is in most cases due to memory leaks in the user program, leaking > >>> threads etc and it leads to a failure loop due to the fact that the job > >> is > >>> restarted within the same JVM (Taskmanager). After the restart the leak > >>> gets worse and worse eventually crashing the TMs one after the other > and > >>> never recovering. > >>> > >>> These issues are extremely hard to debug (might only cause problems > >> after a > >>> few failures) and can cause long lasting instabilities. > >>> > >>> I suggest we enable an option that would trigger a complete restart > every > >>> so many failures. This would release all containers (TM and JM) and > >> restart > >>> everything. > >>> > >>> The only argument against this I see is that this might further hide > the > >>> root cause of the problem on the job/user side. While this is true a > >> stuck > >>> production job with crashing TM is probably much worse out of these 2. > >>> > >>> What do you think? > >>> > >>> Gyula > >> > >> > >