[
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067111#comment-16067111
]
Jason Lowe commented on TEZ-3718:
---------------------------------
bq. I'm not sure why AMNodeIMpl treats NodeUnhealthy and NodeBlacklisted
differently from each other w.r.t the config which determines whether tasks
need to be restarted or not.
I'm not sure I know the full history in this, but unhealthy vs. blacklisted can
represent different contexts. For example, a node could go unhealthy because
too many disks have failed. We probably want to proactively re-run upstream
tasks rather than wait for the fetch failures. Node blacklisting is a little
different than unhealthy nodes. If a task runs and fails before completing, we
may want to blacklist that node to prevent other tasks from also failing to
complete on that node. But if we have _completed_ tasks on that blacklisted
node then there's a decent chance we can complete the shuffle despite the fact
that tasks are failing. For example, task needs to use a GPU but something
about the GPU setup on that node causes all tasks trying to use it to crash.
If tasks that didn't need the GPU ran and succeeded on the node, why are we
proactively re-running them rather than just fetching their inputs? That could
be a huge waste of work and end up being a large performance hit to the job.
TEZ-3072 was filed because of behavior like this. It all comes down to these
two questions:
- if a node is unhealthy, is it likely I won't be able to successfully shuffle
data from it?
- if a node is blacklisted, is it likely I won't be able to successfully
shuffle data from it?
If the answer to both of them is always the same regarding whether we re-run
completed tasks then yes, we should treat them equivalently. I think there are
cases where we would want them to be different. Full disclosure -- we've been
running in a mode where we do _not_ re-run completed tasks on nodes if they go
unhealthy or are blacklisted. We found many cases where the node was still
able to shuffle most (sometimes all) of the completed data for tasks despite
being declared unhealthy or blacklisted. In short, re-running was causing more
problems than it was fixing for us, so now we simply wait for the fetch
failures. It's not always optimal, of course, and there are cases where
proactively re-running would have been preferable.
> Better handling of 'bad' nodes
> ------------------------------
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of
> relying on a timeout which leads to the attempt being marked as FAILED after
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure
> heuristics. Normally source tasks require multiple consumers to report
> failure for them to be marked as bad. If a single consumer reports failure
> against a source which ran on a bad node, consider it bad and re-schedule
> immediately. (Otherwise failures can take a while to propagate, and jobs get
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and
> restart sources which ran on a bad node. Also running tasks being counted as
> FAILURES instead of KILLS.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)