elon_X created FLINK-35178:
------------------------------
Summary: Checkpoint CLAIM mode does not fully control snapshot
ownership
Key: FLINK-35178
URL: https://issues.apache.org/jira/browse/FLINK-35178
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.18.0
Reporter: elon_X
Attachments: image-2024-04-20-14-51-21-062.png
When I enable incremental checkpointing, and the task fails or is canceled for
some reason, restarting the task from {{-s checkpoint_path}} with {{restoreMode
CLAIM}} allows the Flink job to recover from the last checkpoint, it just
discards the previous checkpoint.
Then I found that this leads to the following two cases:
1. If the new checkpoint_x meta file does not reference files in the shared
directory under the previous jobID:
the shared and taskowned directories from the previous Job will be left as
empty directories, and these two directories will persist without being deleted
by Flink. !image-2024-04-20-14-51-21-062.png!
2. If the new checkpoint_x meta file references files in the shared directory
under the previous jobID:
the chk-(x-1) from the previous job will be discarded, but there will still be
state data in the shared directory under that job, which might persist for a
relatively long time. Here arises the question: the previous job is no longer
running, and it's unclear whether users should delete the state data. Deleting
it could lead to errors when the task is restarted, as the meta might reference
files that can no longer be found; this could be confusing for users.
The potential solution might be to reuse the previous job's jobID when
restoring from {{{}-s checkpoint_path{}}}, or to add a new parameter that
allows users to specify the jobID they want to recover from;
Please correct me if there's anything I've misunderstood.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)