[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280409#comment-17280409 ]
Zhu Zhu commented on FLINK-17726: --------------------------------- Thanks for relaunching this discussion and proposing a solution. I'd like to double confirm the proposal. Please correct me if I understand it incorrectly: 1. task should not be CANCELLED in TM unless it was CANCELING. It should be transitioned into FAILED with a "secondary" failure with the information of the root cause task 2. JM triggers failovers on "primary" failures and ignores related secondary failures. For"secondary" failures, given that the related "primary" failure should always be reported sooner or later, JM can simply mark the task as CANCELED and skip the failure handling. To further improve it, JM can register a timeout on secondary failures in case that the related "primary" failure is not reported, or to speed up the recover without waiting for a heartbeat timeout. 3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING to CANCELED in TM, which is never expected to happen though after the work of #1 > Scheduler should take care of tasks directly canceled by TaskManager > -------------------------------------------------------------------- > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task > Affects Versions: 1.11.0, 1.12.0 > Reporter: Zhu Zhu > Priority: Critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)