[ 
https://issues.apache.org/jira/browse/FLINK-20872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu updated FLINK-20872:
------------------------
    Affects Version/s:     (was: 1.11.0)
                       1.10.0

> Job resume from history savepoint when failover if checkpoint is disabled
> -------------------------------------------------------------------------
>
>                 Key: FLINK-20872
>                 URL: https://issues.apache.org/jira/browse/FLINK-20872
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.10.0, 1.12.0
>            Reporter: Liu
>            Priority: Minor
>
> I have a long running job. Its checkpoint is disabled and restartStrategy is 
> set.  One time I upgrade the job through savepoint. One day later, the job is 
> failed and restart automatically. But it is resumed from the previous 
> savepoint so that the job is heavily lagged.
>  
> I have checked the code and find that the job will first try to resume from 
> checkpoint and then savepoint.
> {code:java}
> if (checkpointCoordinator != null) {
>     // check whether we find a valid checkpoint
>     if (!checkpointCoordinator.restoreInitialCheckpointIfPresent(
>             new HashSet<>(newExecutionGraph.getAllVertices().values()))) {
>         // check whether we can restore from a savepoint
>         tryRestoreExecutionGraphFromSavepoint(
>                 newExecutionGraph, jobGraph.getSavepointRestoreSettings());
>     }
> }
> {code}
> For job which checkpoint is disabled, internal failover should not resume 
> from previous savepoint, especially the savepoint is done long long ago. In 
> this situation, state loss is acceptable but lag is not acceptable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to