renxiang zhou created FLINK-31249:
-------------------------------------
Summary: Checkpoint Timer failed to process timeout events when it
blocked at writing _metadata to DFS
Key: FLINK-31249
URL: https://issues.apache.org/jira/browse/FLINK-31249
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing
Affects Versions: 1.16.0, 1.11.6
Reporter: renxiang zhou
Fix For: 1.18.0
Attachments: image-2023-02-28-11-25-03-637.png
The jobmanager-future thread may be blocked at writing metadata to DFS caused
by a DFS failure, and the CheckpointCoordinator Lock is hold by this thread.
When the next Checkpoint is triggered, the Checkpoint Timer thread waits for
the lock to be released. If the previous checkpoint times out, the checkpoint
timer does not execute the timeout event since it is blocked at waiting for the
lock. As a result, the previous checkpoint cannot be cancelled.
!image-2023-02-28-11-25-03-637.png|width=1144,height=248!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)