That’s a good point Till. Blacklisting TMs could be able to handle this. One scenario that might be problematic is if clean restart is needed after a more or less random number of job resubmissions, like if resource leakage has different rates on different nodes. In such situation, if we blacklist and restart TMs one by one, Job can keep failing constantly with failures caused every time by a different TM. It could end up with a dead loop in some scenarios/setups. Where the Gyula’s proposal would restart all of the TMs at once, reseting the leakage on all of the TMs at the same time, making a successful restart possible.
I still think that blacklisting TMs is a better way to do it, but maybe we still need some kind of limit, like after N blacklists restart all TMs. But this would also add an additional complexity. Piotrek > On 3 Jan 2019, at 13:59, Till Rohrmann <[email protected]> wrote: > > Hi Gyula, > > I see the benefit of having such an option. In fact, it goes in a similar > direction as the currently ongoing discussion about blacklisting TMs. In > the end it could work by reporting failures to the RM which aggregates some > statistics for the individual TMs. Based on some thresholds it could then > decide to free/blacklist a specific TM. Whether to blacklist or restart a > container could then be a configurable option. > > Cheers, > Till > > On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <[email protected]> wrote: > >> Hi Gyula, >> >> Personally I do not see a problem with providing such an option of “clean >> restart” after N failures, especially if we set the default value for N to >> +infinity. However guys working more with Flink’s scheduling systems might >> have more to say about this. >> >> Piotrek >> >>> On 29 Dec 2018, at 13:36, Gyula Fóra <[email protected]> wrote: >>> >>> Hi all! >>> >>> In the past years while running Flink in production we have seen a huge >>> number of scenarios when the Flink jobs can go into unrecoverable failure >>> loops and only a complete manual restart helps. >>> >>> This is in most cases due to memory leaks in the user program, leaking >>> threads etc and it leads to a failure loop due to the fact that the job >> is >>> restarted within the same JVM (Taskmanager). After the restart the leak >>> gets worse and worse eventually crashing the TMs one after the other and >>> never recovering. >>> >>> These issues are extremely hard to debug (might only cause problems >> after a >>> few failures) and can cause long lasting instabilities. >>> >>> I suggest we enable an option that would trigger a complete restart every >>> so many failures. This would release all containers (TM and JM) and >> restart >>> everything. >>> >>> The only argument against this I see is that this might further hide the >>> root cause of the problem on the job/user side. While this is true a >> stuck >>> production job with crashing TM is probably much worse out of these 2. >>> >>> What do you think? >>> >>> Gyula >> >>
