[ 
https://issues.apache.org/jira/browse/YARN-9656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951399#comment-16951399
 ] 

Wangda Tan commented on YARN-9656:
----------------------------------

[~pgolash], [~mayank_bansal], to me if a node cannot schedule new tasks because 
of either near-full disk or stressed, it is under the same "unhealthy" state.  

Is there any diagnostic we can use to put a reasonable why the node is 
unhealthy? If we can add a "unhealthy reason/type" to node info, is that good 
enough to solve the problem? Putting this to a file and load by RM seems just a 
way to by-pass RPC between RM/NM but the leave a lot of works to the plugin to 
implement logics like collect NM metrics, putting them to a file and place it 
to a filesystem which is accessible by RM. 

If we choose to leave the plugin in NM, anybody can implement new logic to 
categorize issues on NM and admin can query it from the web UI, etc. 

Thoughts?

> Plugin to avoid scheduling jobs on node which are not in "schedulable" state, 
> but are healthy otherwise.
> --------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9656
>                 URL: https://issues.apache.org/jira/browse/YARN-9656
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager, resourcemanager
>    Affects Versions: 2.9.1, 3.1.2
>            Reporter: Prashant Golash
>            Assignee: Prashant Golash
>            Priority: Major
>         Attachments: 2.patch
>
>
> Creating this Jira to get idea from the community if this is something 
> helpful which can be done in YARN. Some times the nodes go in a bad state for 
> e.g. (H/W problem: I/O is bad; Fan problem). In some other scenarios, if 
> CGroup is not enabled, nodes may be running very high on CPU and the jobs 
> scheduled on them will suffer.
>  
> The idea is three-fold:
>  # Gather relevant metrics from node-managers and put in some form (for e.g. 
> exclude file).
>  # RM loads the files and put the nodes as part of the blacklist.
>  # Once the node becomes good, they can again be put in the whitelist.
> Various optimizations can be done here, but I would like to understand if 
> this is something which could be helpful as an upstream feature in YARN.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to