Chesnay Schepler created FLINK-27972:
----------------------------------------

             Summary: Race condition between task/savepoint notification failure
                 Key: FLINK-27972
                 URL: https://issues.apache.org/jira/browse/FLINK-27972
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.15.0
            Reporter: Chesnay Schepler


When a task throws an exception in notifyCheckpointComplete we send 2 messages 
to the JobManager:
1) we inform the CheckpointCoordinator about the failed savepoint
2) we inform the scheduler about the failed task.

Depending on how these arrive the adaptive scheduler exhibits different 
behaviors. If 1) arrives first it properly informs the user about the created 
savepoint which might contain uncommitted transactions; if 2) arrives first it 
just restarts the job.

I'm not sure how big of an issue the latter case is.

In any case we might want to consider having the StopWithSavepoint state wait 
until the savepoint future has failed before doing anything else.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to