Re: Flink failure recovery tooks very long time

Yun Tang Thu, 06 Sep 2018 04:43:37 -0700

Hi Kien

>From your description, your job has already started to execute checkpoint 
>after job failover, which means your job was in RUNNING status. From my point 
>of view, the actual recovery time should be the time during job's status: 
>RESTARTING->CREATED->RUNNING[1].
Your trouble sounds more like the long time needed for the first checkpoint to 
complete after job failover. Afaik, It's probably because your job is heavily 
back pressured after the failover and the checkpoint mode is exactly-once, some 
operators need to receive all the input checkpoint barrier to trigger the 
checkpoint. You can watch your metrics of checkpoint alignment time to verify 
the root cause, and if you do not need the exactly once guarantees, you can 
change the checkpoint mode to at-least-once[2].


Best
Yun Tang


[1] 
https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html#jobmanager-data-structures
[2] 
https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once
[https://ci.apache.org/projects/flink/flink-docs-release-1.6/fig/stream_barriers.svg]<https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once>

Apache Flink 1.6 Documentation: Data Streaming Fault 
Tolerance<https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once>
Apache Flink offers a fault tolerance mechanism to consistently recover the 
state of data streaming applications. The mechanism ensures that even in the 
presence of failures, the program’s state will eventually reflect every record 
from the data stream exactly once. Note that there is a switch to ...
ci.apache.org


________________________________
From: trung kien <kient...@gmail.com>
Sent: Thursday, September 6, 2018 18:50
To: user@flink.apache.org
Subject: Flink failure recovery tooks very long time

Hi all,

I am trying to test failure recovery of a Flink job when a JM or TM goes down.
Our target is having job auto restart and back to normal condition in any case.

However, what's I am seeing is very strange and hope someone here help me to 
understand it.

When JM or TM went down, I see the job was being restarted but as soon as it 
restart it's working on checkingpoint and usually took 30+ minutes to finish 
(usually in normal case, it only take 1-2 mins for checkpoint), As soon as the 
checkpoint is finish, the job is back to normal condition.

I'm using 1.4.2, but seeing similar thing on 1.6.0 as well.

Could anyone please help to explain this behavior? We really want to reduce the 
time of recovery but doesn't seem to find any document mentioned about recovery 
process in detail.

Any help is really appreciate.


--
Thanks
Kien
--
Thanks
Kien

Re: Flink failure recovery tooks very long time

Reply via email to