[jira] [Commented] (FLINK-35178) Checkpoint CLAIM mode does not fully control snapshot ownership

elon_X (Jira) Mon, 22 Apr 2024 00:20:53 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839522#comment-17839522
 ]


elon_X commented on FLINK-35178:
--------------------------------

[~lijinzhong] 

Thank you for your response.

I have been using the default value (true) for the 
"state.checkpoints.create-subdir" parameter. However, when I tested by setting 
this value to false, the result was the same, which might indicate I'm doing 
something wrong.

Additionally, I've encountered another issue. Even though I set 
{{{}state.checkpoints.num-retained=3{}}}, the older job's checkpoint versions 
are not being discarded even if they are not referenced. Only the checkpoint 
specified by the {{-s}} option (chk-x) is discarded.

As shown in the diagram below, I restored from chk-34, but only chk-34 was 
discarded, while chk-32 and chk-33 continue to exist indefinitely.

!image-2024-04-22-15-16-02-381.png!

> Checkpoint CLAIM mode does not fully control snapshot ownership
> ---------------------------------------------------------------
>
>                 Key: FLINK-35178
>                 URL: https://issues.apache.org/jira/browse/FLINK-35178
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.18.0
>            Reporter: elon_X
>            Priority: Major
>         Attachments: image-2024-04-20-14-51-21-062.png, 
> image-2024-04-22-15-16-02-381.png
>
>
> When I enable incremental checkpointing, and the task fails or is canceled 
> for some reason, restarting the task from {{-s checkpoint_path}} with 
> {{restoreMode CLAIM}} allows the Flink job to recover from the last 
> checkpoint, it just discards the previous checkpoint.
> Then I found that this leads to the following two cases:
> 1. If the new checkpoint_x meta file does not reference files in the shared 
> directory under the previous jobID:         
> the shared and taskowned directories from the previous Job will be left as 
> empty directories, and these two directories will persist without being 
> deleted by Flink. !image-2024-04-20-14-51-21-062.png!
> 2. If the new checkpoint_x meta file references files in the shared directory 
> under the previous jobID:
> the chk-(x-1) from the previous job will be discarded, but there will still 
> be state data in the shared directory under that job, which might persist for 
> a relatively long time. Here arises the question: the previous job is no 
> longer running, and it's unclear whether users should delete the state data. 
> Deleting it could lead to errors when the task is restarted, as the meta 
> might reference files that can no longer be found; this could be confusing 
> for users.
>  
> The potential solution might be to reuse the previous job's jobID when 
> restoring from {{{}-s checkpoint_path{}}}, or to add a new parameter that 
> allows users to specify the jobID they want to recover from;
>  
> Please correct me if there's anything I've misunderstood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35178) Checkpoint CLAIM mode does not fully control snapshot ownership

Reply via email to