Hi, Recently I’ve noticed a job has nondeterministic checkpoint trigger time.
The jobs is using Flink 1.12.1 with FsStateBackend and is of 650 parallelism. It was configured to trigger checkpoint every 150 seconds with 0 pause time and no concurrent checkpoints. However there’re obvious errors in the checkpoint trigger times, as the actual interval may vary from 30 seconds to 6 minutes. The jobmanager logs are good, and no error logs is found. Some of the output are as follow: 2021-11-23 13:51:46,438 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1446 for job f432b8d90859db54f7a79ff29a563ee4 (47142264825 bytes in 22166 ms). 2021-11-23 13:57:21,021 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1447 (type=CHECKPOINT) @ 1637647040653 for job f432b8d90859db54f7a79ff29a563ee4. 2021-11-23 13:57:43,761 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1447 for job f432b8d90859db54f7a79ff29a563ee4 (46563195101 bytes in 21813 ms). 2021-11-23 13:59:09,387 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1448 (type=CHECKPOINT) @ 1637647149157 for job f432b8d90859db54f7a79ff29a563ee4. 2021-11-23 13:59:31,370 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1448 for job f432b8d90859db54f7a79ff29a563ee4 (45543757702 bytes in 20354 ms). 2021-11-23 14:06:37,916 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1449 (type=CHECKPOINT) @ 1637647597704 for job f432b8d90859db54f7a79ff29a563ee4. 2021-11-23 14:07:03,157 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1449 for job f432b8d90859db54f7a79ff29a563ee4 (45662471025 bytes in 23779 ms). 2021-11-23 14:07:05,838 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1450 (type=CHECKPOINT) @ 1637647625640 for job f432b8d90859db54f7a79ff29a563ee4. 2021-11-23 14:07:30,748 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1450 for job f432b8d90859db54f7a79ff29a563ee4 (46916136024 bytes in 22998 ms). 2021-11-23 14:13:09,089 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1451 (type=CHECKPOINT) @ 1637647988831 for job f432b8d90859db54f7a79ff29a563ee4. 2021-11-23 14:13:38,411 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1451 for job f432b8d90859db54f7a79ff29a563ee4 (47439074367 bytes in 27616 ms). 2021-11-23 14:13:38,676 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1452 (type=CHECKPOINT) @ 1637648018481 for job f432b8d90859db54f7a79ff29a563ee4. 2021-11-23 14:14:01,937 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1452 for job f432b8d90859db54f7a79ff29a563ee4 (47046200711 bytes in 21869 ms). 2021-11-23 14:20:04,923 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1453 (type=CHECKPOINT) @ 1637648404722 for job f432b8d90859db54f7a79ff29a563ee4. 2021-11-23 14:20:26,592 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1453 for job f432b8d90859db54f7a79ff29a563ee4 (47481503566 bytes in 20172 ms). 2021-11-23 14:21:54,879 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1454 (type=CHECKPOINT) @ 1637648514668 for job f432b8d90859db54f7a79ff29a563ee4. 2021-11-23 14:22:19,392 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1454 for job f432b8d90859db54f7a79ff29a563ee4 (47106414948 bytes in 22930 ms). It looks pretty weird to me. Please help me narrow down the problem if you have any idea. Best, Paul Lam