[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065786#comment-16065786 ]
Siddharth Seth commented on TEZ-3718: ------------------------------------- I'm not sure why AMNodeIMpl treats NodeUnhealthy and NodeBlacklisted differently from each other w.r.t the config which determines whether tasks need to be restarted or not. I think both can be treated the same. [~jlowe] - you may have more context on this. The changes to Node related classes mostly look good to me. Instead of isUnhealthy in the event - this could be an enum (UNHEALTHY, BLACKLISTED) For AMContainer, not sure why the fail task config needs to be read. Will the following work. - When event received, annotate the container to say "On A Failed Node" (Already done) - Inform prior and current attempts of the node failure. - Don't change container state - allow a task action to change the state via a STOP_REQEUST depending on task level configs. OR If there are not running fragments on the container, change state to a COMPLETED state, so that new tasks allocations are not accepted. Do not accept new tasks since nodeFailure has been set. TaskAttempt - From a brief glance, the functionality looks good. Fail_Fast / decide whether to keep a task / cause it to be killed on a node failure. Genearl - Don't read from a Configuration instance within each AMContainer / TaskAttemptImpl - there's example code on how to avoid this in TaskImpl/TaskAttemptImpl - Thought the configs woiuld be the following TEZ_AM_NODE_UNHEALTHY_RESCHEDULE_TASKS=false - Current, default=false TEZ_AM_NODE_UNHEALTHY_KILL_RUNNING=true - New, default=true (overrides TEZ_AM_NODE_UNHEALTHY_RESCHEDULE_TASKS) Third config in the patch looks good > Better handling of 'bad' nodes > ------------------------------ > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)