[ https://issues.apache.org/jira/browse/FLINK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286040#comment-17286040 ]
Roman Khachatryan commented on FLINK-21351: ------------------------------------------- I think that would work but would keep savepoints not subsumed unnecessarily. A little bit different approach would allow to subsume savepoints too: 1. Iterate through the completed checkpoints starting from the earliest 2. Subsume a checkpoint if it's earlier than the last checkpoint-not-savepoint 3. Subsume a savepoint if it's not the last one 4. Break whenever checkpoints.size <= maxRetain I've published a PR with this change, could you take a look: https://github.com/apache/flink/pull/14953? > Incremental checkpoint data would be lost once a non-stop savepoint completed > ----------------------------------------------------------------------------- > > Key: FLINK-21351 > URL: https://issues.apache.org/jira/browse/FLINK-21351 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.11.3, 1.12.1, 1.13.0 > Reporter: Yun Tang > Assignee: Roman Khachatryan > Priority: Blocker > Labels: pull-request-available > Fix For: 1.11.4, 1.13.0, 1.12.3 > > > FLINK-10354 counted savepoint as retained checkpoint so that job could > failover from latest position. I think this operation is reasonable, however, > current implementation would let incremental checkpoint data lost immediately > once a non-stop savepoint completed. > Current general phase of incremental checkpoints: once a newer checkpoint > completed, it would be added to checkpoint store. And if the size of > completed checkpoints larger than max retained limit, it would subsume the > oldest one. This lead to the reference of incremental data decrease one and > data would be deleted once reference reached to zero. As we always ensure to > register newer checkpoint and then unregister older checkpoint, current phase > works fine as expected. > However, if a non-stop savepoint (a median manual trigger savepoint) is > completed, it would be also added into checkpoint store and just subsume > previous added checkpoint (in default retain one checkpoint case), which > would unregister older checkpoint without newer checkpoint registered, > leading to data lost. > Thanks for [~banmoy] reporting this problem first. -- This message was sent by Atlassian Jira (v8.3.4#803005)