Piotr Nowojski created FLINK-38180: -------------------------------------- Summary: Race condition between failing Task and cancelation hiding the real exception Key: FLINK-38180 URL: https://issues.apache.org/jira/browse/FLINK-38180 Project: Flink Issue Type: Bug Components: Runtime / Task Affects Versions: 1.20.2, 1.19.3, 1.18.1, 1.17.2, 1.16.3, 1.15.4, 2.0.0, 1.14.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski
When task fails, during {{Task#restoreAndInvoke}} invocation we will call {{finalInvokable.cleanUp(throwable)}}, which ultimately can cancel some state backend operations. This cancelation can then report {{Task#failExternally(CancelTaskException)}} before the task thread manages to switch to {{FAILED}} state and properly set the {{failureCause}}. When that happens, task switches from {{RUNNING}} to {{FAILED}}: {noformat} switched from RUNNING to FAILED due to CancelTaskException. {noformat} from that async cancellation. Later, when task thread processes the real failure, Task is already in {{FAILED}} state, so the real failure is being ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010)