Piotr Nowojski created FLINK-38180:
--------------------------------------
Summary: Race condition between failing Task and cancelation
hiding the real exception
Key: FLINK-38180
URL: https://issues.apache.org/jira/browse/FLINK-38180
Project: Flink
Issue Type: Bug
Components: Runtime / Task
Affects Versions: 1.20.2, 1.19.3, 1.18.1, 1.17.2, 1.16.3, 1.15.4, 2.0.0,
1.14.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
When task fails, during {{Task#restoreAndInvoke}} invocation we will call
{{finalInvokable.cleanUp(throwable)}}, which ultimately can cancel some state
backend operations. This cancelation can then report
{{Task#failExternally(CancelTaskException)}} before the task thread manages to
switch to {{FAILED}} state and properly set the {{failureCause}}. When that
happens, task switches from {{RUNNING}} to {{FAILED}}:
{noformat}
switched from RUNNING to FAILED due to CancelTaskException.
{noformat}
from that async cancellation.
Later, when task thread processes the real failure, Task is already in
{{FAILED}} state, so the real failure is being ignored.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)