Piotr Nowojski created FLINK-38180:
--------------------------------------

             Summary: Race condition between failing Task and cancelation 
hiding the real exception
                 Key: FLINK-38180
                 URL: https://issues.apache.org/jira/browse/FLINK-38180
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Task
    Affects Versions: 1.20.2, 1.19.3, 1.18.1, 1.17.2, 1.16.3, 1.15.4, 2.0.0, 
1.14.0
            Reporter: Piotr Nowojski
            Assignee: Piotr Nowojski


When task fails, during {{Task#restoreAndInvoke}} invocation we will call 
{{finalInvokable.cleanUp(throwable)}}, which ultimately can cancel some state 
backend operations. This cancelation can then report 
{{Task#failExternally(CancelTaskException)}} before the task thread manages to 
switch to {{FAILED}} state and properly set the {{failureCause}}. When that 
happens, task switches from {{RUNNING}} to {{FAILED}}:

{noformat}
switched from RUNNING to FAILED due to CancelTaskException.
{noformat}

from that async cancellation. 

Later, when task thread processes the real failure, Task is already in 
{{FAILED}} state, so the real failure is being ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to