[jira] [Created] (FLINK-35178) Checkpoint CLAIM mode does not fully control snapshot ownership

elon_X (Jira) Fri, 19 Apr 2024 23:57:08 -0700

elon_X created FLINK-35178:
------------------------------

             Summary: Checkpoint CLAIM mode does not fully control snapshot 
ownership
                 Key: FLINK-35178
                 URL: https://issues.apache.org/jira/browse/FLINK-35178
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.18.0
            Reporter: elon_X
         Attachments: image-2024-04-20-14-51-21-062.png


When I enable incremental checkpointing, and the task fails or is canceled for 
some reason, restarting the task from {{-s checkpoint_path}} with {{restoreMode 
CLAIM}} allows the Flink job to recover from the last checkpoint, it just 
discards the previous checkpoint.

Then I found that this leads to the following two cases:

1. If the new checkpoint_x meta file does not reference files in the shared 
directory under the previous jobID:         

the shared and taskowned directories from the previous Job will be left as 
empty directories, and these two directories will persist without being deleted 
by Flink. !image-2024-04-20-14-51-21-062.png!

2. If the new checkpoint_x meta file references files in the shared directory 
under the previous jobID:

the chk-(x-1) from the previous job will be discarded, but there will still be 
state data in the shared directory under that job, which might persist for a 
relatively long time. Here arises the question: the previous job is no longer 
running, and it's unclear whether users should delete the state data. Deleting 
it could lead to errors when the task is restarted, as the meta might reference 
files that can no longer be found; this could be confusing for users.

 

The potential solution might be to reuse the previous job's jobID when 
restoring from {{{}-s checkpoint_path{}}}, or to add a new parameter that 
allows users to specify the jobID they want to recover from;

 

Please correct me if there's anything I've misunderstood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35178) Checkpoint CLAIM mode does not fully control snapshot ownership

Reply via email to