Matthias Pohl created FLINK-38883:
-------------------------------------

             Summary: Race condition of REST API and JRS entry might lead to 
inconsistent state
                 Key: FLINK-38883
                 URL: https://issues.apache.org/jira/browse/FLINK-38883
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 2.1.1, 2.2.0, 1.20.3, 2.0.1
            Reporter: Matthias Pohl


We noticed an issue where the REST API reported a job being globally terminated 
({{FAILED}}) but the JRS entry wasn't created (due to some object store 
problems). The external monitor marked that as terminal due to the REST API 
call but the job recovered because no JRS entry existed and the job data wasn't 
cleaned up, i.e. during recovery of the JobManager the job was picked up again.

Conceptually, the problem stems from the fact that the JRS entry is only 
written after the job reached the globally terminal state (which is reported 
via the REST API). Instead, it should be written before reaching that state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to