Hi,

Recently I’ve noticed a job has nondeterministic checkpoint trigger time. 

The jobs is using Flink 1.12.1 with FsStateBackend and is of 650 parallelism. 
It was configured to trigger checkpoint every 150 seconds with 0 pause time and 
no concurrent checkpoints. However there’re obvious errors in the checkpoint 
trigger times, as the actual interval may vary from 30 seconds to 6 minutes.

The jobmanager logs are good, and no error logs is found. Some of the output 
are as follow: 

2021-11-23 13:51:46,438 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1446 for job f432b8d90859db54f7a79ff29a563ee4 (47142264825 bytes in 
22166 ms).
2021-11-23 13:57:21,021 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 1447 (type=CHECKPOINT) @ 1637647040653 for job 
f432b8d90859db54f7a79ff29a563ee4.
2021-11-23 13:57:43,761 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1447 for job f432b8d90859db54f7a79ff29a563ee4 (46563195101 bytes in 
21813 ms).
2021-11-23 13:59:09,387 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 1448 (type=CHECKPOINT) @ 1637647149157 for job 
f432b8d90859db54f7a79ff29a563ee4.
2021-11-23 13:59:31,370 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1448 for job f432b8d90859db54f7a79ff29a563ee4 (45543757702 bytes in 
20354 ms).
2021-11-23 14:06:37,916 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 1449 (type=CHECKPOINT) @ 1637647597704 for job 
f432b8d90859db54f7a79ff29a563ee4.
2021-11-23 14:07:03,157 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1449 for job f432b8d90859db54f7a79ff29a563ee4 (45662471025 bytes in 
23779 ms).
2021-11-23 14:07:05,838 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 1450 (type=CHECKPOINT) @ 1637647625640 for job 
f432b8d90859db54f7a79ff29a563ee4.
2021-11-23 14:07:30,748 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1450 for job f432b8d90859db54f7a79ff29a563ee4 (46916136024 bytes in 
22998 ms).
2021-11-23 14:13:09,089 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 1451 (type=CHECKPOINT) @ 1637647988831 for job 
f432b8d90859db54f7a79ff29a563ee4.
2021-11-23 14:13:38,411 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1451 for job f432b8d90859db54f7a79ff29a563ee4 (47439074367 bytes in 
27616 ms).
2021-11-23 14:13:38,676 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 1452 (type=CHECKPOINT) @ 1637648018481 for job 
f432b8d90859db54f7a79ff29a563ee4.
2021-11-23 14:14:01,937 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1452 for job f432b8d90859db54f7a79ff29a563ee4 (47046200711 bytes in 
21869 ms).
2021-11-23 14:20:04,923 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 1453 (type=CHECKPOINT) @ 1637648404722 for job 
f432b8d90859db54f7a79ff29a563ee4.
2021-11-23 14:20:26,592 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1453 for job f432b8d90859db54f7a79ff29a563ee4 (47481503566 bytes in 
20172 ms).
2021-11-23 14:21:54,879 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 1454 (type=CHECKPOINT) @ 1637648514668 for job 
f432b8d90859db54f7a79ff29a563ee4.
2021-11-23 14:22:19,392 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1454 for job f432b8d90859db54f7a79ff29a563ee4 (47106414948 bytes in 
22930 ms).

It looks pretty weird to me. Please help me narrow down the problem if you have 
any idea.

Best,
Paul Lam

Reply via email to