Hi trung, Can you provide more information to aid in positioning? For example, the size of the state generated by a checkpoint and more log information, you can try to switch the log level to DEBUG.
Thanks, vino. Yun Tang <myas...@live.com> 于2018年9月6日周四 下午7:42写道: > Hi Kien > > From your description, your job has already started to execute checkpoint > after job failover, which means your job was in RUNNING status. From my > point of view, the actual recovery time should be the time during job's > status: RESTARTING->CREATED->RUNNING[1]. > Your trouble sounds more like the long time needed for the first > checkpoint to complete after job failover. Afaik, It's probably because > your job is heavily back pressured after the failover and the checkpoint > mode is exactly-once, some operators need to receive all the input > checkpoint barrier to trigger the checkpoint. You can watch your metrics of > checkpoint alignment time to verify the root cause, and if you do not need > the exactly once guarantees, you can change the checkpoint mode to > at-least-once[2]. > > Best > Yun Tang > > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html#jobmanager-data-structures > [2] > https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once > > <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once> > Apache Flink 1.6 Documentation: Data Streaming Fault Tolerance > <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once> > Apache Flink offers a fault tolerance mechanism to consistently recover > the state of data streaming applications. The mechanism ensures that even > in the presence of failures, the program’s state will eventually reflect > every record from the data stream exactly once. Note that there is a switch > to ... > ci.apache.org > > ------------------------------ > *From:* trung kien <kient...@gmail.com> > *Sent:* Thursday, September 6, 2018 18:50 > *To:* user@flink.apache.org > *Subject:* Flink failure recovery tooks very long time > > Hi all, > > I am trying to test failure recovery of a Flink job when a JM or TM goes > down. > Our target is having job auto restart and back to normal condition in any > case. > > However, what's I am seeing is very strange and hope someone here help me > to understand it. > > When JM or TM went down, I see the job was being restarted but as soon as > it restart it's working on checkingpoint and usually took 30+ minutes to > finish (usually in normal case, it only take 1-2 mins for checkpoint), As > soon as the checkpoint is finish, the job is back to normal condition. > > I'm using 1.4.2, but seeing similar thing on 1.6.0 as well. > > Could anyone please help to explain this behavior? We really want to > reduce the time of recovery but doesn't seem to find any document mentioned > about recovery process in detail. > > Any help is really appreciate. > > > -- > Thanks > Kien > -- > Thanks > Kien >