设置下检查点失败不影响任务呀,你这貌似还导致任务重启了?
Frost Wong <frostw...@hotmail.com> 于2021年3月18日周四 上午10:38写道: > Hi 大家好 > > 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误 > > 2021-03-18 08:52:37,019 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in > 4699 ms). > 2021-03-18 08:52:37,637 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > 2021-03-18 08:52:42,956 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes > in 4939 ms). > 2021-03-18 08:52:43,528 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > 2021-03-18 09:12:43,528 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before > completing. > 2021-03-18 09:12:43,615 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to > recover from a global failure. > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable > failure threshold. > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_231] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > ~[?:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > ~[?:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[?:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[?:1.8.0_231] > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231] > 2021-03-18 09:12:43,618 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from > state RUNNING to RESTARTING. > 2021-03-18 09:12:43,619 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to > CANCELING. > 2021-03-18 09:12:43,622 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to > CANCELING. > 2021-03-18 09:12:43,622 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to > CANCELING. > 然后就自己恢复了。用的是Unaligned > Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。 > > 谢谢大家! >