[ 
https://issues.apache.org/jira/browse/YARN-8345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kartik Bhatia resolved YARN-8345.
---------------------------------
    Resolution: Duplicate

> NodeHealthCheckerService to differentiate between reason for UnusableNodes 
> for client to act suitably on it
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8345
>                 URL: https://issues.apache.org/jira/browse/YARN-8345
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>            Reporter: Kartik Bhatia
>            Priority: Major
>
> +*Current Scenario :*+ 
> NodeHealthCheckerService marks a node Unhealthy on basis of 2 things : 
>  # External Script
>  # Directory status
> If a directory is marked as full(as per DiskCheck configs in yarn-site), node 
> manager marks this as unhealthy. 
> Once a node is marked unhealthy, mapreduce launches all the map tasks that 
> ran on this usable node. This leads to even successful tasks being relaunched.
> +{color:#333333}*Problem :*{color}+
> {color:#333333}We do not have distinction between disk limit to stop 
> container launch on that node and limit so that reducer can read data from 
> that node.{color}
> {color:#333333}For Example : {color}
> {color:#333333}Let us consider a 3 TB disk. If we set max disk utilisation 
> percentage as 95% (since launch of container requires approx 0.15 TB for jobs 
> in our cluster) and there are few nodes where disk utilisation is say 96%, 
> the threshold will be breached. These nodes will be marked unhealthy by 
> NodeManager. This will result in all successful mappers being relaunched on 
> other nodes. But still 4% memory is good enough for reducers to read that 
> data. This causes unnecessary delay in our jobs. (Mappers launching again can 
> preempt reducers if there is crunch for space and there are issues with 
> calculating Headroom in Capacity scheduler as well){color}
>  
> +*Correction :*+
> We need a state (say UNUSABLE_WRITE) that can let mapreduce know that node is 
> still good for reading data and successful mappers should not be relaunched. 
> This can prevent delay.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to