Re: Flink job recovery after task manager failure

Afek, Ifat (Nokia - IL/Kfar Sava) Tue, 01 Mar 2022 13:56:00 -0800

Hi,

I’m trying to understand the guidelines for task manager recovery.
From what I see in the documentation, state backend can be set as in memory / 
file system / rocksdb, and the checkpoint storage requires a shared file system 
for both file system and rocksdb. Is that correct? Must the file system be 
shared between the task managers and job managers? Is there another option?

Thanks,
Ifat

From: Zhilong Hong <zhlongh...@gmail.com>
Date: Thursday, 24 February 2022 at 19:58
To: "Afek, Ifat (Nokia - IL/Kfar Sava)" <ifat.a...@nokia.com>
Cc: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: Flink job recovery after task manager failure

Hi, Afek

I've read the log you provided. Since you've set the value of restart-strategy 
to be exponential-delay and the value of 
restart-strategy.exponential-delay.initial-backoff is 10s, everytime a failover 
is triggered, the JobManager will have to wait for 10 seconds before it 
restarts the job.If you'd prefer a quicker restart, you could change the 
restart strategy to fixed-delay and set a small value for 
restart-strategy.fixed-delay.delay.

Furthermore, there are two more failovers that happened during the 
initialization of recovered tasks. During the initialization of a task, it will 
try to recover the states from the last valid checkpoint. A FileNotFound 
exception happens during the recovery process. I'm not quite sure the reason. 
Since the recovery succeeds after two failovers, I think maybe it's because the 
local disks of your cluster are not stable.

Sincerely,
Zhilong

On Thu, Feb 24, 2022 at 9:04 PM Afek, Ifat (Nokia - IL/Kfar Sava) 
<ifat.a...@nokia.com<mailto:ifat.a...@nokia.com>> wrote:
Thanks Zhilong.

The first launch of our job is fast, I don’t think that’s the issue. I see in 
flink job manager log that there were several exceptions during the restart, 
and the task manager was restarted a few times until it was stabilized.

You can find the log here:
jobmanager-log.txt.gz<https://nokia-my.sharepoint.com/:u:/p/ifat_afek/EUsu4rb_-BpNrkpvSwzI-vgBtBO9OQlIm0CHtW0gsZ7Gqg?email=zhlonghong%40gmail.com&e=ww5Idt>

Thanks,
Ifat

From: Zhilong Hong <zhlongh...@gmail.com<mailto:zhlongh...@gmail.com>>
Date: Wednesday, 23 February 2022 at 19:38
To: "Afek, Ifat (Nokia - IL/Kfar Sava)" 
<ifat.a...@nokia.com<mailto:ifat.a...@nokia.com>>
Cc: "user@flink.apache.org<mailto:user@flink.apache.org>" 
<user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Re: Flink job recovery after task manager failure

Hi, Afek!

When a TaskManager is killed, JobManager will not be acknowledged until a 
heartbeat timeout happens. Currently, the default value of heartbeat.timeout is 
50 seconds [1]. That's why it takes more than 30 seconds for Flink to trigger a 
failover. If you'd like to shorten the time a failover is triggered in this 
situation, you could decrease the value of heartbeat.timeout in 
flink-conf.yaml. However, if the value is set too small, heartbeat timeouts 
will happen more frequently and the cluster will be unstable. As FLINK-23403 
[2] mentions, if you are using Flink 1.14 or 1.15, you could try to set the 
value to 10s.

You mentioned that it takes 5-6 minutes to restart the jobs. It seems a bit 
weird. How long does it take to deploy your job for a brand new launch? You 
could compact and upload the log of JobManager to Google Drive or OneDrive and 
attach the sharing link. Maybe we can find out what happens via the log.

Sincerely,
Zhilong

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout
[2] https://issues.apache.org/jira/browse/FLINK-23403

On Thu, Feb 24, 2022 at 12:25 AM Afek, Ifat (Nokia - IL/Kfar Sava) 
<ifat.a...@nokia.com<mailto:ifat.a...@nokia.com>> wrote:
Hi,

I am trying to use Flink checkpoints solution in order to support task manager 
recovery.
I’m running flink using beam with filesystem storage and the following 
parameters:
checkpointingInterval=30000
checkpointingMode=EXACTLY_ONCE.

What I see is that if I kill a task manager pod, it takes flink about 30 
seconds to identify the failure and another 5-6 minutes to restart the jobs.
Is there a way to shorten the downtime? What is an expected downtime in case 
the task manager is killed, until the jobs are recovered? Are there any best 
practices for handling it? (e.g. different configuration parameters)

Thanks,
Ifat

Re: Flink job recovery after task manager failure

Reply via email to