Liu created FLINK-20872:
---------------------------
Summary: Job resume from history savepoint when failover if
checkpoint is disabled
Key: FLINK-20872
URL: https://issues.apache.org/jira/browse/FLINK-20872
Project: Flink
Issue Type: Improvement
Affects Versions: 1.12.0, 1.11.0
Reporter: Liu
I have a long running job. Its checkpoint is disabled and restartStrategy is
set. One time I upgrade the job through savepoint. One day later, the job is
failed and restart automatically. But it is resumed from the previous savepoint
so that the job is heavily lagged.
I have checked the code and find that the job will first try to resume from
checkpoint and then savepoint.
{code:java}
if (checkpointCoordinator != null) {
// check whether we find a valid checkpoint
if (!checkpointCoordinator.restoreInitialCheckpointIfPresent(
new HashSet<>(newExecutionGraph.getAllVertices().values()))) {
// check whether we can restore from a savepoint
tryRestoreExecutionGraphFromSavepoint(
newExecutionGraph, jobGraph.getSavepointRestoreSettings());
}
}
{code}
For job which checkpoint is disabled, internal failover should not resume from
previous savepoint, especially the savepoint is done long long ago. In this
situation, state loss is acceptable but lag is not acceptable.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)