Matthias Pohl created FLINK-38883:
-------------------------------------
Summary: Race condition of REST API and JRS entry might lead to
inconsistent state
Key: FLINK-38883
URL: https://issues.apache.org/jira/browse/FLINK-38883
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 2.1.1, 2.2.0, 1.20.3, 2.0.1
Reporter: Matthias Pohl
We noticed an issue where the REST API reported a job being globally terminated
({{FAILED}}) but the JRS entry wasn't created (due to some object store
problems). The external monitor marked that as terminal due to the REST API
call but the job recovered because no JRS entry existed and the job data wasn't
cleaned up, i.e. during recovery of the JobManager the job was picked up again.
Conceptually, the problem stems from the fact that the JRS entry is only
written after the job reached the globally terminal state (which is reported
via the REST API). Instead, it should be written before reaching that state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)