Re: [DISCUSSION] Complete restart after successive failures

Piotr Nowojski Fri, 04 Jan 2019 01:12:26 -0800

That’s a good point Till. Blacklisting TMs could be able to handle this. One 
scenario that might be problematic is if clean restart is needed after a more 
or less random number of job resubmissions, like if resource leakage has 
different rates on different nodes. In such situation, if we blacklist and 
restart TMs one by one, Job can keep failing constantly with failures caused 
every time by a different TM. It could end up with a dead loop in some 
scenarios/setups. Where the Gyula’s proposal would restart all of the TMs at 
once, reseting the leakage on all of the TMs at the same time, making a 
successful restart possible.


I still think that blacklisting TMs is a better way to do it, but maybe we 
still need some kind of limit, like after N blacklists restart all TMs. But 
this would also add an additional complexity.

Piotrek

> On 3 Jan 2019, at 13:59, Till Rohrmann <trohrm...@apache.org> wrote:
> 
> Hi Gyula,
> 
> I see the benefit of having such an option. In fact, it goes in a similar
> direction as the currently ongoing discussion about blacklisting TMs. In
> the end it could work by reporting failures to the RM which aggregates some
> statistics for the individual TMs. Based on some thresholds it could then
> decide to free/blacklist a specific TM. Whether to blacklist or restart a
> container could then be a configurable option.
> 
> Cheers,
> Till
> 
> On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <pi...@da-platform.com> wrote:
> 
>> Hi Gyula,
>> 
>> Personally I do not see a problem with providing such an option of “clean
>> restart” after N failures, especially if we set the default value for N to
>> +infinity. However guys working more with Flink’s scheduling systems might
>> have more to say about this.
>> 
>> Piotrek
>> 
>>> On 29 Dec 2018, at 13:36, Gyula Fóra <gyula.f...@gmail.com> wrote:
>>> 
>>> Hi all!
>>> 
>>> In the past years while running Flink in production we have seen a huge
>>> number of scenarios when the Flink jobs can go into unrecoverable failure
>>> loops and only a complete manual restart helps.
>>> 
>>> This is in most cases due to memory leaks in the user program, leaking
>>> threads etc and it leads to a failure loop due to the fact that the job
>> is
>>> restarted within the same JVM (Taskmanager). After the restart the leak
>>> gets worse and worse eventually crashing the TMs one after the other and
>>> never recovering.
>>> 
>>> These issues are extremely hard to debug (might only cause problems
>> after a
>>> few failures) and can cause long lasting instabilities.
>>> 
>>> I suggest we enable an option that would trigger a complete restart every
>>> so many failures. This would release all containers (TM and JM) and
>> restart
>>> everything.
>>> 
>>> The only argument against this I see is that this might further hide the
>>> root cause of the problem on the job/user side. While this is true a
>> stuck
>>> production job with crashing TM is probably much worse out of these 2.
>>> 
>>> What do you think?
>>> 
>>> Gyula
>> 
>>

Re: [DISCUSSION] Complete restart after successive failures

Reply via email to