Re: Job Recovery Time on TM Lost

Yang Wang Thu, 01 Jul 2021 04:47:06 -0700

Since you are deploying Flink workloads on Yarn, the Flink ResourceManager
should get the container
completion event after the heartbeat of Yarn NM->Yarn RM->Flink RM, which
is 8 seconds by default.
And Flink ResourceManager will release the dead TaskManager container once
received the completion event.
As a result, Flink will not deploy tasks onto the dead TaskManagers.



I think most of the time cost in Phase 1 might be cancelling the tasks on
the dead TaskManagers.


Best,
Yang


Till Rohrmann <trohrm...@apache.org> 于2021年7月1日周四 下午4:49写道：

> The analysis of Gen is correct. Flink currently uses its heartbeat as the
> primary means to detect dead TaskManagers. This means that Flink will take
> at least `heartbeat.timeout` time before the system recovers. Even if the
> cancellation happens fast (e.g. by having configured a low
> akka.ask.timeout), then Flink will still try to deploy tasks onto the dead
> TaskManager until it is marked as dead and its slots are released (unless
> the ResourceManager does not get a signal from the underlying resource
> management system that a container/pod has died). One way to improve the
> situation is to introduce logic which can react to a ConnectionException
> and then black lists or releases a TaskManager, for example. This is
> currently not implemented in Flink, though.
>
> Concerning the cancellation operation: Flink currently does not listen to
> the dead letters of Akka. This means that the `akka.ask.timeout` is the
> primary means to fail the future result of a rpc which could not be sent.
> This is also an improvement we should add to Flink's RpcService. I've
> created a JIRA issue for this problem [1].
>
> [1] https://issues.apache.org/jira/browse/FLINK-23202
>
> Cheers,
> Till
>
> On Wed, Jun 30, 2021 at 6:33 PM Lu Niu <qqib...@gmail.com> wrote:
>
>> Thanks Gen! cc flink-dev to collect more inputs.
>>
>> Best
>> Lu
>>
>> On Wed, Jun 30, 2021 at 12:55 AM Gen Luo <luogen...@gmail.com> wrote:
>>
>>> I'm also wondering here.
>>>
>>> In my opinion, it's because the JM can not confirm whether the TM is
>>> lost or it's a temporary network trouble and will recover soon, since I can
>>> see in the log that akka has got a Connection refused but JM still sends a
>>> heartbeat request to the lost TM until it reaches heartbeat timeout. But
>>> I'm not sure if it's indeed designed like this.
>>>
>>> I would really appreciate it if anyone who knows more details could
>>> answer. Thanks.
>>>
>>

Re: Job Recovery Time on TM Lost

Reply via email to