Hi team!
I came across strange behavior in Flink 1.17.1. If during the build of a
checkpoint the s3 storage becomes unavailable, then the current checkpoint
expired by timeout and new ones are not triggered.
The triggering for new checkpoints is resumed only after s3 is restored and
this can be after a long time.
I can reproduce it, wait checkpoint and after start disconnect s3 storage
2023-10-27 09:48:11,866 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 2504 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1698400091851 for job
00000000000000000000000000000000.
2023-10-27 09:58:12,873 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint
2504 of job 00000000000000000000000000000000 expired before completing.
2023-10-27 09:58:12,874 WARN
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to
trigger or complete checkpoint 2504 for job 00000000000000000000000000000000.
(0 consecutive failed attempts so far)
after current checkpoint is expired (our timeout 10 min) no new triggering
attempt in logs until restore s3 storage
2023-10-27 10:42:09,530 WARN
org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle [] - Could not
properly discard misc file states.
com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out
2023-10-27 10:42:13,305 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 2505 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1698400691875 for job
00000000000000000000000000000000.
2023-10-27 10:42:39,287 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 2505 for job 00000000000000000000000000000000 (10023840497 bytes,
checkpointDuration=2666106 ms, finalizationTime=1306 ms).
2023-10-27 10:44:39,288 INFO
org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint
request time in queue: 1887436
2023-10-27 10:44:39,300 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 2506 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1698403479288 for job
00000000000000000000000000000000.
2023-10-27 10:44:50,924 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 2506 for job 00000000000000000000000000000000 (10085877149 bytes,
checkpointDuration=11011 ms, finalizationTime=625 ms).
2023-10-27 10:46:50,924 INFO
org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint
request time in queue: 1119073
taskmanager logs on restore s3 storage
2023-10-27 10:42:13,302 DEBUG
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - Cleanup
AsyncCheckpointRunnable for checkpoint 2504 of Process ...
2023-10-27 10:42:13,302 DEBUG
org.apache.flink.streaming.runtime.tasks.StreamTask [] - Notify
checkpoint 2503 complete on task ...
2023-10-27 10:42:13,302 DEBUG
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] -
Notification of checkpoint ABORT 2504 for task ...
It looks like everything hangs on requests for the state of objects in s3
storage (repeated HEAD requests with full object path in s3 storage).
Sometimes it was observed that job completely stops working (no consuming and
producing) until the s3 storage is restored
Is this expected behavior?
P.S. If a storage failure occurs before the start of checkpoint assembly, then
everything works as expected, new checkpoints are triggered every confugured
interval and expire after 10 min.
[cid:01917319-9655-4c20-9ceb-fec81b4638e3]
________________________________
"This message contains confidential information/commercial secret. If you are
not the intended addressee of this message you may not copy, save, print or
forward it to any third party and you are kindly requested to destroy this
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом
отправителя электронным письмом."