[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279743#comment-17279743
 ] 

Piotr Nowojski commented on FLINK-17726:
----------------------------------------

We have discovered this issue once we introduced a bug, that caused Task switch 
to `CANCELLED` state incorrectly (it should switched to `FINISHED`) and 
deadlocked the job.

This led me to have some discussion with [~trohrmann] about a good way how to 
handle this kind of issues. For me the most important part would be to not have 
a regression in the way, how we are reporting the root/primary/real cause of 
the failure. Currently switching from `RUNNING` -> `CANCELLED` state is a valid 
thing to do for the task, if this is a "secondary" failure caused by 
upstream/downstream task issue. This currently allows JobManager to easily 
ignore those "secondary" failures, from the real failures and pick first 
reported "real" failure as the root cause of the job/region failure.

If we followed the proposed here in the task solution, to not allow the 
`RUNNING` -> `CANCELLED` transition, but just simply treat it as a regular 
"primary" failure, I would expect user to be flooded with hundreds of secondary 
failures, which would be extremely difficult for him to figure out what has 
happened. Primary example: "real" failure is a loss of Task Manager that was 
either not detected, or would be detected after heartbeat timeout, which caused 
hundreds/thousands "secondary" failures (currently `RUNNING` -> `CANCELLED` 
transitions). 

My proposal how to deal with this situation, would be to keep the distinction 
of the "secondary" failure, but also enrich it with the information which task 
was the reason behind. JobManager would receive information "Task B1 failed 
because something has happened to with the Task A1". 

That would let us do two things:
* If JobManager managed to detect some primary failure, it could ignore (or 
batch together) all of the secondary failures
* If no primary failure was detected, and we want to failover the job without 
waiting for example for the heartbeat pointing to the primary failure, Job 
Manager could connect secondary failures and the DAG, to deduce that something 
bad has happened with the "Task B1"

> Scheduler should take care of tasks directly canceled by TaskManager
> --------------------------------------------------------------------
>
>                 Key: FLINK-17726
>                 URL: https://issues.apache.org/jira/browse/FLINK-17726
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Task
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Zhu Zhu
>            Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to