[ https://issues.apache.org/jira/browse/YARN-9656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949183#comment-16949183 ]
Prashant Golash commented on YARN-9656: --------------------------------------- Thanks, [~wangda] for taking a look. Initial we thought of just keeping node "unhealthy" and extended our NMs to include these checks in NM health check scripts, but realized that this could result in a lot of unhealthy nodes (For e.g in our cluster), so we thought of adding intermediate stage "stressed" and control by the threshold at RM layer as well. I guess this may be specific to the environment and for upstream just configuring NM scripts should be enough. > Plugin to avoid scheduling jobs on node which are not in "schedulable" state, > but are healthy otherwise. > -------------------------------------------------------------------------------------------------------- > > Key: YARN-9656 > URL: https://issues.apache.org/jira/browse/YARN-9656 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager > Affects Versions: 2.9.1, 3.1.2 > Reporter: Prashant Golash > Assignee: Prashant Golash > Priority: Major > Attachments: 2.patch > > > Creating this Jira to get idea from the community if this is something > helpful which can be done in YARN. Some times the nodes go in a bad state for > e.g. (H/W problem: I/O is bad; Fan problem). In some other scenarios, if > CGroup is not enabled, nodes may be running very high on CPU and the jobs > scheduled on them will suffer. > > The idea is three-fold: > # Gather relevant metrics from node-managers and put in some form (for e.g. > exclude file). > # RM loads the files and put the nodes as part of the blacklist. > # Once the node becomes good, they can again be put in the whitelist. > Various optimizations can be done here, but I would like to understand if > this is something which could be helpful as an upstream feature in YARN. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org