[jira] [Commented] (YARN-10616) Nodemanagers cannot detect GPU failures

Eric Badger (Jira) Tue, 16 Mar 2021 13:11:09 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302864#comment-17302864
 ]


Eric Badger commented on YARN-10616:
------------------------------------

bq. For the "updateNodeResource" issue, one question is that is it a frequently 
used operation? I'm not ware of the scenario that we use this often.
[~ztang], we use this feature internally. Maybe once or twice a day across all 
of our clusters. Usually to quickly remove a node from a cluster while we 
investigate why it's running slow or causing errors. We will use 
{{updateNodeResource}} to set the node resources to 0, meaning that nothing 
will get scheduled on the node. But the NM will still be running so that we can 
jstack or grab a heap dump. For us at least, the only time we ever use this 
operation is to remove a node from the cluster. So maybe there's a different 
way that we could do that such that it doesn't mess with the node resources. 
Because this really is just a simple hack to get the node to node schedule 
anything else

> Nodemanagers cannot detect GPU failures
> ---------------------------------------
>
>                 Key: YARN-10616
>                 URL: https://issues.apache.org/jira/browse/YARN-10616
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>
> As stated above, the bug is that GPUs can fail, but the NM doesn't notice the 
> failure. The NM will continue to schedule tasks onto the failed GPU, but the 
> GPU won't actually work and so the container will likely fail or run very 
> slowly on the CPU. 
> My initial thought on solving this is to add NM resource capabilities to the 
> NM-RM heartbeat and have the RM update its view of the NM's resource 
> capabilities on each heartbeat. This would be a fairly trivial change, but 
> comes with the unfortunate side effect that it completely undermindes {{yarn 
> rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the 
> assumption is that the node will retain these new resource capabilities until 
> either the NM or RM is restarted. But with a heartbeat interaction constantly 
> updating those resource capabilities from the NM perspective, the explicit 
> changes via {{-updateNodeResource}} would be lost on the next heartbeat. We 
> could potentially add a flag to ignore the heartbeat updates for any node who 
> has had {{-updateNodeResource}} called on it (until a re-registration). But 
> in this case, the node would no longer get resource capability updates until 
> the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, 
> then that would give potentially unexpected behavior in relation to nodes 
> properly auto-detecting failures.
> Another idea is to add a GPU monitor thread on the NM to periodically run 
> {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that 
> number decreased, the node would hook into the health check status and mark 
> itself as unhealthy. The downside of this approach is that a single failed 
> GPU would mean taking out an entire node (e.g. 8 GPUs).
> I would really like to go with the NM-RM heartbeat approach, but the 
> {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, 
> but I also don't like taking down whole GPU nodes when only a single GPU is 
> bad. Would like to hear thoughts of others on how best to approach this
> [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10616) Nodemanagers cannot detect GPU failures

Reply via email to