[ 
https://issues.apache.org/jira/browse/FLINK-26683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509923#comment-17509923
 ] 

Dawid Wysakowicz commented on FLINK-26683:
------------------------------------------

I think we should still restart the job if it fails after the savepoint is 
completed, but before the notification for completion reaches tasks. Tasks 
might want to commit its results and that's why the job is restarted.  If you 
stop with savepoint with drain the job should finish on its own shortly after 
the restart. The restart is required to leave the savepoint in a consistent 
state.

It is a bit more tricky with a stop-with-savepoint --no-drain. Still I believe 
it is a sane assumption a savepoint is valid only once two conditions are met: 
1) savepoint was created successfully 2) job terminated. If it has not we 
should recover and continue processing which might require the user to trigger 
another savepoint.

Would be good to understand why you end up in a state where some tasks are 
finished and some are running that should not happen.

Lastly, there is a bug that we do not restore to the synchronous savepoint: 
https://issues.apache.org/jira/browse/FLINK-26783


> Terminate the job anyway if savepoint finished when stop-with-savepoint
> -----------------------------------------------------------------------
>
>                 Key: FLINK-26683
>                 URL: https://issues.apache.org/jira/browse/FLINK-26683
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Liu
>            Priority: Major
>
> When we stop with savepoint, the savepoint finishes. But some tasks failover 
> for some reason and restart to running. In the end, some tasks are finished 
> and some tasks are running. In this case, I think that we should terminate 
> all the tasks anyway instead of restarting since the savepoint is finished 
> and the job stops consuming data. What do you think?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to