[ https://issues.apache.org/jira/browse/FLINK-26683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509923#comment-17509923 ]
Dawid Wysakowicz commented on FLINK-26683: ------------------------------------------ I think we should still restart the job if it fails after the savepoint is completed, but before the notification for completion reaches tasks. Tasks might want to commit its results and that's why the job is restarted. If you stop with savepoint with drain the job should finish on its own shortly after the restart. The restart is required to leave the savepoint in a consistent state. It is a bit more tricky with a stop-with-savepoint --no-drain. Still I believe it is a sane assumption a savepoint is valid only once two conditions are met: 1) savepoint was created successfully 2) job terminated. If it has not we should recover and continue processing which might require the user to trigger another savepoint. Would be good to understand why you end up in a state where some tasks are finished and some are running that should not happen. Lastly, there is a bug that we do not restore to the synchronous savepoint: https://issues.apache.org/jira/browse/FLINK-26783 > Terminate the job anyway if savepoint finished when stop-with-savepoint > ----------------------------------------------------------------------- > > Key: FLINK-26683 > URL: https://issues.apache.org/jira/browse/FLINK-26683 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination > Affects Versions: 1.15.0 > Reporter: Liu > Priority: Major > > When we stop with savepoint, the savepoint finishes. But some tasks failover > for some reason and restart to running. In the end, some tasks are finished > and some tasks are running. In this case, I think that we should terminate > all the tasks anyway instead of restarting since the savepoint is finished > and the job stops consuming data. What do you think? -- This message was sent by Atlassian Jira (v8.20.1#820001)