[ 
https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304456#comment-17304456
 ] 

Eric Badger edited comment on YARN-10616 at 3/18/21, 9:22 PM:
--------------------------------------------------------------

The issue with graceful decommissioning is that you have to edit a file on the 
RM. It would be nice to be able to run a {{yarn rmadmin}} command from a remote 
host to tell the RM to graceful decom a node. AFAIK that functionality doesn't 
exist. 

I still don't like the idea of completely undermining {{-updateNodeResource}}. 
I think I would be more on board with a feature that is disabled by default, 
but can be enabled. That way we won't break any existing ways of doing things, 
but will give more flexibility to those who want to detect these types of 
failures. They will just have to understand that it isn't compatible with 
{{-updateNodeResource}}


was (Author: ebadger):
The issue with graceful decommissioning is that you have to edit a file on the 
RM. It would be nice to be able to run a `yarn rmadmin` command from a remote 
host to tell the RM to graceful decom a node. AFAIK that functionality doesn't 
exist. 

I still don't like the idea of completely undermining {{-updateNodeResource}}. 
I think I would be more on board with a feature that is disabled by default, 
but can be enabled. That way we won't break any existing ways of doing things, 
but will give more flexibility to those who want to detect these types of 
failures. They will just have to understand that it isn't compatible with 
{{-updateNodeResource}}

> Nodemanagers cannot detect GPU failures
> ---------------------------------------
>
>                 Key: YARN-10616
>                 URL: https://issues.apache.org/jira/browse/YARN-10616
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>
> As stated above, the bug is that GPUs can fail, but the NM doesn't notice the 
> failure. The NM will continue to schedule tasks onto the failed GPU, but the 
> GPU won't actually work and so the container will likely fail or run very 
> slowly on the CPU. 
> My initial thought on solving this is to add NM resource capabilities to the 
> NM-RM heartbeat and have the RM update its view of the NM's resource 
> capabilities on each heartbeat. This would be a fairly trivial change, but 
> comes with the unfortunate side effect that it completely undermindes {{yarn 
> rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the 
> assumption is that the node will retain these new resource capabilities until 
> either the NM or RM is restarted. But with a heartbeat interaction constantly 
> updating those resource capabilities from the NM perspective, the explicit 
> changes via {{-updateNodeResource}} would be lost on the next heartbeat. We 
> could potentially add a flag to ignore the heartbeat updates for any node who 
> has had {{-updateNodeResource}} called on it (until a re-registration). But 
> in this case, the node would no longer get resource capability updates until 
> the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, 
> then that would give potentially unexpected behavior in relation to nodes 
> properly auto-detecting failures.
> Another idea is to add a GPU monitor thread on the NM to periodically run 
> {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that 
> number decreased, the node would hook into the health check status and mark 
> itself as unhealthy. The downside of this approach is that a single failed 
> GPU would mean taking out an entire node (e.g. 8 GPUs).
> I would really like to go with the NM-RM heartbeat approach, but the 
> {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, 
> but I also don't like taking down whole GPU nodes when only a single GPU is 
> bad. Would like to hear thoughts of others on how best to approach this
> [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to