[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280409#comment-17280409
 ] 

Zhu Zhu commented on FLINK-17726:
---------------------------------

Thanks for relaunching this discussion and proposing a solution.
I'd like to double confirm the proposal. Please correct me if I understand it 
incorrectly:
1. task should not be CANCELLED in TM unless it was CANCELING. It should be 
transitioned into FAILED with a "secondary" failure with the information of the 
root cause task
2. JM triggers failovers on "primary" failures and ignores related secondary 
failures. For"secondary" failures, given that the related "primary" failure 
should always be reported sooner or later, JM can simply mark the task as 
CANCELED and skip the failure handling. To further improve it, JM can register 
a timeout on secondary failures in case that the related "primary" failure is 
not reported, or to speed up the recover without waiting for a heartbeat 
timeout.
3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING 
to CANCELED in TM, which is never expected to happen though after the work of #1

> Scheduler should take care of tasks directly canceled by TaskManager
> --------------------------------------------------------------------
>
>                 Key: FLINK-17726
>                 URL: https://issues.apache.org/jira/browse/FLINK-17726
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Task
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Zhu Zhu
>            Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to