[ 
https://issues.apache.org/jira/browse/FLINK-6755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857564#comment-16857564
 ] 

Aljoscha Krettek commented on FLINK-6755:
-----------------------------------------

Hi,

I had a brief discussion with Stephan that helped me sort my thoughts on the 
broader topics of checkpoints, savepoints, binary formats, user-triggered 
checkpoints, and periodic savepoints. I’ll try to summarise my stance on this 
and also comment with the same message on the other relevant Jira Issues and 
threads.

For reference, the relevant FLIP and Jira issues are these:

- 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-41%3A+Unify+Keyed+State+Snapshot+Binary+Format+for+Savepoints:
 Unified Savepoint Format
- FLINK-12619: Add support for stop-with-checkpoint
- FLINK-6755: User-triggered checkpoints
- FLINK-4620: Automatically creating savepoints
- FLINK-4511: Schedule periodic savepoints

There are roughly two different dimensions in the topic of 
savepoints/checkpoints (I’ll use snapshot as the generic term for both):
 1) who controls the snapshot
 2) what’s the (binary) format of the snapshot

For 1), we currently have checkpoints and savepoints. Checkpoints are created 
by the system for fault tolerance. They are managed by the system and the 
system is free to discard them when it sees fit. Savepoints are in the control 
of the user. A user can choose to create a save point, they can delete them, 
they can restore from them at will. The system will not clean up savepoints. We 
should try and keep this separation and not muddle the two concepts.

For 2), we currently have various different formats between the different state 
backends and also for the same backend. I.e. RocksDB can do full or incremental 
snapshots, local snapshots, and probably more.

FLIP-41 aims at introducing a unified “savepoint" format that is 
interchangeable between the different state backends. In light of the above 
points, we should say that FLIP-41 aims to introduce a canonical format that is 
interchangeable between different backends. This doesn’t mean that we should 
tie this format strictly to savepoints, though. For performance reasons, users 
might choose to do savepoints that use one of the optimised formats that the 
backends offer, for example incremental snapshots. Or they might choose to use 
the canonical format for regular checkpoints so that they can always switch 
between backends using periodically created externalised checkpoints.

The motivation behind FLINK-12619 is to have a more lightweight alternative for 
stop-with-savepoint, for example using the incremental snapshot format that 
RocksDB has. With the above in mind, however, this becomes “Add support for 
choosing the snapshot format for stop-with-savepoint”. It should not be 
stop-with-checkpoint, because checkpoints are something that the system manages 
and not something that the user should trigger. The same is true for 
FLINK-6755, the motivation is the same I think. The change should be called 
“Add support for choosing the snapshot format for savepoints”, however.

For the last two Jira issues mentioned above it should be quite clear what I 
think. I do, however, see a need for potentially different overlapping 
checkpoint periods or intervals. Users might want to have their regular 
checkpoints use an optimised format but they also want to have a “canonical 
format” checkpoint every no and then so that the lineage of incremental 
checkpoints does not become too unwieldy.

Please let me know what you think!

Aljoscha

> Allow triggering Checkpoints through command line client
> --------------------------------------------------------
>
>                 Key: FLINK-6755
>                 URL: https://issues.apache.org/jira/browse/FLINK-6755
>             Project: Flink
>          Issue Type: New Feature
>          Components: Command Line Client, Runtime / Checkpointing
>    Affects Versions: 1.3.0
>            Reporter: Gyula Fora
>            Assignee: vinoyang
>            Priority: Major
>
> The command line client currently only allows triggering (and canceling with) 
> Savepoints. 
> While this is good if we want to fork or modify the pipelines in a 
> non-checkpoint compatible way, now with incremental checkpoints this becomes 
> wasteful for simple job restarts/pipeline updates. 
> I suggest we add a new command: 
> ./bin/flink checkpoint <jobID> [checkpointDirectory]
> and a new flag -c for the cancel command to indicate we want to trigger a 
> checkpoint:
> ./bin/flink cancel -c [targetDirectory] <jobID>
> Otherwise this can work similar to the current savepoint taking logic, we 
> could probably even piggyback on the current messages by adding boolean flag 
> indicating whether it should be a savepoint or a checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to