[ https://issues.apache.org/jira/browse/FLINK-6755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857564#comment-16857564 ]
Aljoscha Krettek commented on FLINK-6755: ----------------------------------------- Hi, I had a brief discussion with Stephan that helped me sort my thoughts on the broader topics of checkpoints, savepoints, binary formats, user-triggered checkpoints, and periodic savepoints. I’ll try to summarise my stance on this and also comment with the same message on the other relevant Jira Issues and threads. For reference, the relevant FLIP and Jira issues are these: - https://cwiki.apache.org/confluence/display/FLINK/FLIP-41%3A+Unify+Keyed+State+Snapshot+Binary+Format+for+Savepoints: Unified Savepoint Format - FLINK-12619: Add support for stop-with-checkpoint - FLINK-6755: User-triggered checkpoints - FLINK-4620: Automatically creating savepoints - FLINK-4511: Schedule periodic savepoints There are roughly two different dimensions in the topic of savepoints/checkpoints (I’ll use snapshot as the generic term for both): 1) who controls the snapshot 2) what’s the (binary) format of the snapshot For 1), we currently have checkpoints and savepoints. Checkpoints are created by the system for fault tolerance. They are managed by the system and the system is free to discard them when it sees fit. Savepoints are in the control of the user. A user can choose to create a save point, they can delete them, they can restore from them at will. The system will not clean up savepoints. We should try and keep this separation and not muddle the two concepts. For 2), we currently have various different formats between the different state backends and also for the same backend. I.e. RocksDB can do full or incremental snapshots, local snapshots, and probably more. FLIP-41 aims at introducing a unified “savepoint" format that is interchangeable between the different state backends. In light of the above points, we should say that FLIP-41 aims to introduce a canonical format that is interchangeable between different backends. This doesn’t mean that we should tie this format strictly to savepoints, though. For performance reasons, users might choose to do savepoints that use one of the optimised formats that the backends offer, for example incremental snapshots. Or they might choose to use the canonical format for regular checkpoints so that they can always switch between backends using periodically created externalised checkpoints. The motivation behind FLINK-12619 is to have a more lightweight alternative for stop-with-savepoint, for example using the incremental snapshot format that RocksDB has. With the above in mind, however, this becomes “Add support for choosing the snapshot format for stop-with-savepoint”. It should not be stop-with-checkpoint, because checkpoints are something that the system manages and not something that the user should trigger. The same is true for FLINK-6755, the motivation is the same I think. The change should be called “Add support for choosing the snapshot format for savepoints”, however. For the last two Jira issues mentioned above it should be quite clear what I think. I do, however, see a need for potentially different overlapping checkpoint periods or intervals. Users might want to have their regular checkpoints use an optimised format but they also want to have a “canonical format” checkpoint every no and then so that the lineage of incremental checkpoints does not become too unwieldy. Please let me know what you think! Aljoscha > Allow triggering Checkpoints through command line client > -------------------------------------------------------- > > Key: FLINK-6755 > URL: https://issues.apache.org/jira/browse/FLINK-6755 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing > Affects Versions: 1.3.0 > Reporter: Gyula Fora > Assignee: vinoyang > Priority: Major > > The command line client currently only allows triggering (and canceling with) > Savepoints. > While this is good if we want to fork or modify the pipelines in a > non-checkpoint compatible way, now with incremental checkpoints this becomes > wasteful for simple job restarts/pipeline updates. > I suggest we add a new command: > ./bin/flink checkpoint <jobID> [checkpointDirectory] > and a new flag -c for the cancel command to indicate we want to trigger a > checkpoint: > ./bin/flink cancel -c [targetDirectory] <jobID> > Otherwise this can work similar to the current savepoint taking logic, we > could probably even piggyback on the current messages by adding boolean flag > indicating whether it should be a savepoint or a checkpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005)