Zhu Zhu created FLINK-14375: ------------------------------- Summary: Avoid to trigger failover on a non-effective task failure notification Key: FLINK-14375 URL: https://issues.apache.org/jira/browse/FLINK-14375 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Zhu Zhu Fix For: 1.10.0
The DefaultScheduler triggers failover if a task is notified to be FAILED. However, in the case the multiple tasks in the same region fail together, it will trigger multiple failovers. The later triggered failovers are useless, lead to concurrent failovers and will increase the restart attempts count. To avoid that, I'd propose to check that the effectiveness of a task failure to decide whether to trigger a failover, namely checking in {{DefaultScheduler#maybeHandleTaskFailure()}} to see whether the vertex state is really FAILED rather than CANCELED. -- This message was sent by Atlassian Jira (v8.3.4#803005)