Hi Yun,

Thanks a lot for your pointers! I’ll try it out as you suggested and then get 
back to you.

Best,
Paul Lam

> 2021年11月23日 16:32,Yun Tang <myas...@live.com> 写道:
> 
> Hi Paul,
> 
> This is really weird, from what I know, flink-1.11.0 has a problem of 
> handling min-pause time [1] and this should be resolved in flink-1.12.1.
> 
> Could you open the debug log level for org.apache.flink.runtime.checkpoint 
> and use jmap or byteman to get the field value of 
> CheckpointCoordinator#lastCheckpointCompletionRelativeTime, 
> CheckpointRequestDecider#minPauseBetweenCheckpoints and 
> SystemClock#relativeTimeMillis in method 
> CheckpointRequestDecider#nextTriggerDelayMillis [2] to see any unexpected 
> behavior.
> 
> 
> [1] https://issues.apache.org/jira/browse/FLINK-18856 
> <https://issues.apache.org/jira/browse/FLINK-18856>
> [2] 
> https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182
>  
> <https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182>
> 
> 
> Best
> Yun Tang
> 
> From: Paul Lam <paullin3...@gmail.com>
> Sent: Tuesday, November 23, 2021 14:35
> To: user <user@flink.apache.org>
> Subject: Inaccurate checkpoint trigger time
>  
> Hi,
> 
> Recently I’ve noticed a job has nondeterministic checkpoint trigger time. 
> 
> The jobs is using Flink 1.12.1 with FsStateBackend and is of 650 parallelism. 
> It was configured to trigger checkpoint every 150 seconds with 0 pause time 
> and no concurrent checkpoints. However there’re obvious errors in the 
> checkpoint trigger times, as the actual interval may vary from 30 seconds to 
> 6 minutes.
> 
> The jobmanager logs are good, and no error logs is found. Some of the output 
> are as follow: 
> 
> 
> 2021-11-23 13:51:46,438 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 1446 for job f432b8d90859db54f7a79ff29a563ee4 (47142264825 bytes 
> in 22166 ms).
> 
> 2021-11-23 13:57:21,021 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 1447 (type=CHECKPOINT) @ 1637647040653 for job 
> f432b8d90859db54f7a79ff29a563ee4.
> 
> 2021-11-23 13:57:43,761 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 1447 for job f432b8d90859db54f7a79ff29a563ee4 (46563195101 bytes 
> in 21813 ms).
> 
> 2021-11-23 13:59:09,387 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 1448 (type=CHECKPOINT) @ 1637647149157 for job 
> f432b8d90859db54f7a79ff29a563ee4.
> 
> 2021-11-23 13:59:31,370 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 1448 for job f432b8d90859db54f7a79ff29a563ee4 (45543757702 bytes 
> in 20354 ms).
> 
> 2021-11-23 14:06:37,916 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 1449 (type=CHECKPOINT) @ 1637647597704 for job 
> f432b8d90859db54f7a79ff29a563ee4.
> 
> 2021-11-23 14:07:03,157 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 1449 for job f432b8d90859db54f7a79ff29a563ee4 (45662471025 bytes 
> in 23779 ms).
> 
> 2021-11-23 14:07:05,838 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 1450 (type=CHECKPOINT) @ 1637647625640 for job 
> f432b8d90859db54f7a79ff29a563ee4.
> 
> 2021-11-23 14:07:30,748 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 1450 for job f432b8d90859db54f7a79ff29a563ee4 (46916136024 bytes 
> in 22998 ms).
> 
> 2021-11-23 14:13:09,089 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 1451 (type=CHECKPOINT) @ 1637647988831 for job 
> f432b8d90859db54f7a79ff29a563ee4.
> 
> 2021-11-23 14:13:38,411 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 1451 for job f432b8d90859db54f7a79ff29a563ee4 (47439074367 bytes 
> in 27616 ms).
> 
> 2021-11-23 14:13:38,676 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 1452 (type=CHECKPOINT) @ 1637648018481 for job 
> f432b8d90859db54f7a79ff29a563ee4.
> 
> 2021-11-23 14:14:01,937 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 1452 for job f432b8d90859db54f7a79ff29a563ee4 (47046200711 bytes 
> in 21869 ms).
> 
> 2021-11-23 14:20:04,923 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 1453 (type=CHECKPOINT) @ 1637648404722 for job 
> f432b8d90859db54f7a79ff29a563ee4.
> 
> 2021-11-23 14:20:26,592 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 1453 for job f432b8d90859db54f7a79ff29a563ee4 (47481503566 bytes 
> in 20172 ms).
> 
> 2021-11-23 14:21:54,879 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 1454 (type=CHECKPOINT) @ 1637648514668 for job 
> f432b8d90859db54f7a79ff29a563ee4.
> 
> 2021-11-23 14:22:19,392 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
> checkpoint 1454 for job f432b8d90859db54f7a79ff29a563ee4 (47106414948 bytes 
> in 22930 ms).
> 
> 
> It looks pretty weird to me. Please help me narrow down the problem if you 
> have any idea.
> 
> Best,
> Paul Lam

Reply via email to