[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Aljoscha Krettek (JIRA) Thu, 06 Jun 2019 07:30:12 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857747#comment-16857747
 ]


Aljoscha Krettek commented on FLINK-12619:
------------------------------------------

The documentation states what I also described: savepoints are user-controlled 
and checkpoints are system-controlled. That's why think efforts such as 
allowing user-triggered checkpoints (which includes a user-triggered 
stop-with-checkpoint) break that distinction between user control and system 
control. That being said, there are clear use cases where you want a more 
light-weight savepoint, and for those we should allow taking a savepoint in a 
more efficient format (i.e. the format that checkpoints normally use.

The "canonical format" that (in my opinion) FLIP-41 will introduce can be used 
to create savepoints that are compatible between backends. What I'm saying, 
however, is that we should not strictly tie this to only savepoints.

With the changes of FLIP-41, the default config for taking a savepoint might be 
"use the canonical but slow format" and the default for checkpoints might be 
"use the optimized (incremental) format. But users can choose to do a 
stop-with-savepoint using the optimized incremental format because they know 
that they don't want to switch to a different state backend and would like the 
speed and size benefit of the faster format.

> Support TERMINATE/SUSPEND Job with Checkpoint
> ---------------------------------------------
>
>                 Key: FLINK-12619
>                 URL: https://issues.apache.org/jira/browse/FLINK-12619
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / State Backends
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Congxian Qiu(klion26)
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Inspired by the idea of FLINK-11458, we propose to support terminate/suspend 
> a job with checkpoint. This improvement cooperates with incremental and 
> external checkpoint features, that if checkpoint is retained and this feature 
> is configured, we will trigger a checkpoint before the job stops. It could 
> accelarate job recovery a lot since:
> 1. No source rewinding required any more.
> 2. It's much faster than taking a savepoint since incremental checkpoint is 
> enabled.
> Please note that conceptually savepoints is different from checkpoint in a 
> similar way that backups are different from recovery logs in traditional 
> database systems. So we suggest using this feature only for job recovery, 
> while stick with FLINK-11458 for the 
> upgrading/cross-cluster-job-migration/state-backend-switch cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Reply via email to