[ https://issues.apache.org/jira/browse/FLINK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839522#comment-17839522 ]
elon_X commented on FLINK-35178: -------------------------------- [~lijinzhong] Thank you for your response. I have been using the default value (true) for the "state.checkpoints.create-subdir" parameter. However, when I tested by setting this value to false, the result was the same, which might indicate I'm doing something wrong. Additionally, I've encountered another issue. Even though I set {{{}state.checkpoints.num-retained=3{}}}, the older job's checkpoint versions are not being discarded even if they are not referenced. Only the checkpoint specified by the {{-s}} option (chk-x) is discarded. As shown in the diagram below, I restored from chk-34, but only chk-34 was discarded, while chk-32 and chk-33 continue to exist indefinitely. !image-2024-04-22-15-16-02-381.png! > Checkpoint CLAIM mode does not fully control snapshot ownership > --------------------------------------------------------------- > > Key: FLINK-35178 > URL: https://issues.apache.org/jira/browse/FLINK-35178 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.18.0 > Reporter: elon_X > Priority: Major > Attachments: image-2024-04-20-14-51-21-062.png, > image-2024-04-22-15-16-02-381.png > > > When I enable incremental checkpointing, and the task fails or is canceled > for some reason, restarting the task from {{-s checkpoint_path}} with > {{restoreMode CLAIM}} allows the Flink job to recover from the last > checkpoint, it just discards the previous checkpoint. > Then I found that this leads to the following two cases: > 1. If the new checkpoint_x meta file does not reference files in the shared > directory under the previous jobID: > the shared and taskowned directories from the previous Job will be left as > empty directories, and these two directories will persist without being > deleted by Flink. !image-2024-04-20-14-51-21-062.png! > 2. If the new checkpoint_x meta file references files in the shared directory > under the previous jobID: > the chk-(x-1) from the previous job will be discarded, but there will still > be state data in the shared directory under that job, which might persist for > a relatively long time. Here arises the question: the previous job is no > longer running, and it's unclear whether users should delete the state data. > Deleting it could lead to errors when the task is restarted, as the meta > might reference files that can no longer be found; this could be confusing > for users. > > The potential solution might be to reuse the previous job's jobID when > restoring from {{{}-s checkpoint_path{}}}, or to add a new parameter that > allows users to specify the jobID they want to recover from; > > Please correct me if there's anything I've misunderstood. -- This message was sent by Atlassian Jira (v8.20.10#820010)