Steven Zhen Wu created FLINK-27101:
--------------------------------------

             Summary: Periodically break the chain of incremental checkpoint
                 Key: FLINK-27101
                 URL: https://issues.apache.org/jira/browse/FLINK-27101
             Project: Flink
          Issue Type: New Feature
          Components: Runtime / Checkpointing
            Reporter: Steven Zhen Wu


Incremental checkpoint is almost a must for large-state jobs. It greatly 
reduces the bytes uploaded to DFS per checkpoint. However, there are  a few 
implications from incremental checkpoint that are problematic for production 
operations.  Will use S3 as an example DFS in the rest of description.

1. Because there is no way to deterministically know how far back the 
incremental checkpoint can refer to files uploaded to S3, it is very difficult 
to set S3 bucket/object TTL. In one application, we have observed Flink 
checkpoint referring to files uploaded over 6 months ago. S3 TTL can corrupt 
the Flink checkpoints.

S3 TTL is important for a few reasons
- purge orphaned files (like external checkpoints from previous deployments) to 
keep the storage cost in check. This problem can be addressed by implementing 
proper garbage collection (similar to JVM) by traversing the retained 
checkpoints from all jobs and traverse the file references. But that is an 
expensive solution from engineering cost perspective.
- Security and privacy. E.g., there may be requirement that Flink state can't 
keep the data for more than some duration threshold (hours/days/weeks). 
Application is expected to purge keys to satisfy the requirement. However, with 
incremental checkpoint and how deletion works in RocksDB, it is hard to set S3 
TTL to purge S3 files. Even though those old S3 files don't contain live keys, 
they may still be referrenced by retained Flink checkpoints.

2. Occasionally, corrupted checkpoint files (on S3) are observed. As a result, 
restoring from checkpoint failed. With incremental checkpoint, it usually 
doesn't help to try other older checkpoints, because they may refer to the same 
corrupted file. It is unclear whether the corruption happened before or during 
S3 upload. This risk can be mitigated with periodical savepoints.

It all boils down to periodical full snapshot (checkpoint or savepoint) to 
deterministically break the chain of incremental checkpoints. Search the jira 
history, the behavior that FLINK-23949 [1] trying to fix is actually close to 
what we would need here.

There are a few options

1. Periodically trigger savepoints (via control plane). This is actually not a 
bad practice and might be appealing to some people. The problem is that it 
requires a job deployment to break the chain of incremental checkpoint. 
periodical job deployment may sound hacky. If we make the behavior of full 
checkpoint after a savepoint (fixed in FLINK-23949) configurable, it might be 
an acceptable compromise. The benefit is that no job deployment is required 
after savepoints.

2. Build the feature in Flink incremental checkpoint. Periodically (with some 
cron style config) trigger a full checkpoint to break the incremental chain. If 
the full checkpoint failed (due to whatever reason), the following checkpoints 
should attempt full checkpoint as well until one successful full checkpoint is 
completed.

3. For the security/privacy requirement, the main thing is to apply compaction 
on the deleted keys. That could probably avoid references to the old files. Is 
there any RocksDB compation can achieve full compaction of removing old delete 
markers. Recent delete markers are fine


[1] https://issues.apache.org/jira/browse/FLINK-23949



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to