[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
renxiang zhou updated FLINK-31249: ---------------------------------- Summary: Checkpoint timeout mechanism fails when completePendingCheckpoint is stuck (was: Checkpoint Timer failed to process timeout events when it blocked at writing _metadata to DFS) > Checkpoint timeout mechanism fails when completePendingCheckpoint is stuck > -------------------------------------------------------------------------- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.11.6, 1.16.0 > Reporter: renxiang zhou > Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png > > > The {{jobmanager-future}} thread may be blocked at writing metadata to DFS > caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by > this thread. > When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits > for the lock to be released. If the previous checkpoint times out, the > {{Checkpoint Timer}} will not execute the timeout event since it is blocked > at waiting for the lock. As a result, the previous checkpoint cannot be > cancelled. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! -- This message was sent by Atlassian Jira (v8.20.10#820010)