Re: Flink failure recovery tooks very long time

vino yang Thu, 06 Sep 2018 05:10:08 -0700

Hi trung,

Can you provide more information to aid in positioning? For example, the
size of the state generated by a checkpoint and more log information, you
can try to switch the log level to DEBUG.


Thanks, vino.

Yun Tang <myas...@live.com> 于2018年9月6日周四 下午7:42写道：

> Hi Kien
>
> From your description, your job has already started to execute checkpoint
> after job failover, which means your job was in RUNNING status. From my
> point of view, the actual recovery time should be the time during job's
> status: RESTARTING->CREATED->RUNNING[1].
> Your trouble sounds more like the long time needed for the first
> checkpoint to complete after job failover. Afaik, It's probably because
> your job is heavily back pressured after the failover and the checkpoint
> mode is exactly-once, some operators need to receive all the input
> checkpoint barrier to trigger the checkpoint. You can watch your metrics of
> checkpoint alignment time to verify the root cause, and if you do not need
> the exactly once guarantees, you can change the checkpoint mode to
> at-least-once[2].
>
> Best
> Yun Tang
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html#jobmanager-data-structures
> [2]
> https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once
>
> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once>
> Apache Flink 1.6 Documentation: Data Streaming Fault Tolerance
> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once>
> Apache Flink offers a fault tolerance mechanism to consistently recover
> the state of data streaming applications. The mechanism ensures that even
> in the presence of failures, the program’s state will eventually reflect
> every record from the data stream exactly once. Note that there is a switch
> to ...
> ci.apache.org
>
> ------------------------------
> *From:* trung kien <kient...@gmail.com>
> *Sent:* Thursday, September 6, 2018 18:50
> *To:* user@flink.apache.org
> *Subject:* Flink failure recovery tooks very long time
>
> Hi all,
>
> I am trying to test failure recovery of a Flink job when a JM or TM goes
> down.
> Our target is having job auto restart and back to normal condition in any
> case.
>
> However, what's I am seeing is very strange and hope someone here help me
> to understand it.
>
> When JM or TM went down, I see the job was being restarted but as soon as
> it restart it's working on checkingpoint and usually took 30+ minutes to
> finish (usually in normal case, it only take 1-2 mins for checkpoint), As
> soon as the checkpoint is finish, the job is back to normal condition.
>
> I'm using 1.4.2, but seeing similar thing on 1.6.0 as well.
>
> Could anyone please help to explain this behavior? We really want to
> reduce the time of recovery but doesn't seem to find any document mentioned
> about recovery process in detail.
>
> Any help is really appreciate.
>
>
> --
> Thanks
> Kien
> --
> Thanks
> Kien
>

Re: Flink failure recovery tooks very long time

Reply via email to