Re: Job Recovery Time on TM Lost

Gen Luo Tue, 29 Jun 2021 19:52:23 -0700

Hi Lu,

We found almost the same thing when we were trying failover in a large
scale job. The akka.ask.timeout and heartbeat.timeout were set to 10min for
the test, and we found that the job would take 10min to recover from TM
lost.


We reached the conclusion that the behavior is expected in the Flink
version we used(1.13), and it should also apply to previous versions.
Here's how we think things are happening.

Flink JM uses heartbeat to check TM status. When a TM is lost, an upstream
or downstream TM will first sense it, while JM is not aware until the
heartbeat of the lost TM is timeout. The upstream/downstream TM then sends
an exception to JM, and JM will start canceling the job by sending cancel
requests to all TMs. However, the lost TM can't respond to the request, and
JM has to wait until the cancel request(via akka) is timeout before it can
mark the TM failed and continue the failover procedure.

We suppose that the phase from CANCELING to CANCELED takes
min(akka.ask.timeout, heartbeat.timeout), though not confirmed yet.

Hope it helps. Please let me know if there's anything wrong.

Lu Niu <qqib...@gmail.com> 于2021年6月30日周三 上午8:45写道：

> Hi, Flink Users
>
> We(Pinterest) are trying to speed up recovery speed when flink jobs hit
> one-time exceptions. To understand the baseline, the first test we do is to
> randomly kill one TM container and watch for how fast the flink job can
> recover. We did such test to multiple jobs and here are some findings:
>
>    - The whole recovery stage can break down into two phases:
>       - Phase 1: job state switched from RUNNING -> RESTARTING ->
>       RUNNING. All tasks switch from RUNNING -> CANCELING -> CANCELED.
>       - Phase 2: All tasks switch from CREATED -> SCHEDULED -> DEPLOYING
>       -> RUNNING.
>    - Phase 1 always takes around 30s and Phase 2 takes around 10 - 15s.
>
> Question:
> Why does Phase 1 always take about 30s? I shared related logs about 2 jobs
> showing that. Does it have sth to do with akka config?
>
> Our setup:
> flink version 1.11
> running yarn per-job mode
> akka.ask.timeout: 30 s
> akka.lookup.timeout: 30 s
> akka.tcp.timeout: 30 s
>
> Best
> Lu
>
>

Re: Job Recovery Time on TM Lost

Reply via email to