[jira] [Commented] (FLINK-21351) Incremental checkpoint data would be lost once a non-stop savepoint completed

Roman Khachatryan (Jira) Wed, 17 Feb 2021 10:07:20 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286040#comment-17286040
 ]


Roman Khachatryan commented on FLINK-21351:
-------------------------------------------

I think that would work but would keep savepoints not subsumed unnecessarily.

A little bit different approach would allow to subsume savepoints too:
1. Iterate through the completed checkpoints starting from the earliest 
2. Subsume a checkpoint if it's earlier than the last checkpoint-not-savepoint
3. Subsume a savepoint if it's not the last one
4. Break whenever checkpoints.size <= maxRetain

I've published a PR with this change, could you take a look: 
https://github.com/apache/flink/pull/14953?

> Incremental checkpoint data would be lost once a non-stop savepoint completed
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-21351
>                 URL: https://issues.apache.org/jira/browse/FLINK-21351
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.3, 1.12.1, 1.13.0
>            Reporter: Yun Tang
>            Assignee: Roman Khachatryan
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.11.4, 1.13.0, 1.12.3
>
>
> FLINK-10354 counted savepoint as retained checkpoint so that job could 
> failover from latest position. I think this operation is reasonable, however, 
> current implementation would let incremental checkpoint data lost immediately 
> once a non-stop savepoint completed.
> Current general phase of incremental checkpoints: once a newer checkpoint 
> completed, it would be added to checkpoint store. And if the size of 
> completed checkpoints larger than max retained limit, it would subsume the 
> oldest one. This lead to the reference of incremental data decrease one and 
> data would be deleted once reference reached to zero. As we always ensure to 
> register newer checkpoint and then unregister older checkpoint, current phase 
> works fine as expected.
> However, if a non-stop savepoint (a median manual trigger savepoint) is 
> completed, it would be also added into checkpoint store and just subsume 
> previous added checkpoint (in default retain one checkpoint case), which 
> would unregister older checkpoint without newer checkpoint registered, 
> leading to data lost.
> Thanks for [~banmoy] reporting this problem first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21351) Incremental checkpoint data would be lost once a non-stop savepoint completed

Reply via email to