[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062972#comment-14062972
 ] 

Craig Welch commented on YARN-1680:
-----------------------------------

I was also wondering if we could maintain a Resource representing the amount of 
resources blacklisted by the application which was updated as nodes/racks were 
blacklisted and removed from the application blacklist instead of iterating the 
nodes looking for the amount of blacklisted resources at the time of headroom 
calculation.  This "blacklisted" resource would be subtracted from the cluster 
resource (similar to how it works in the current patch in that respect) to make 
sure the headroom calculation is correct.  It seems like this might be a good 
approach as it should be "close to free" to update that blacklist resource when 
adding and removing things form the blacklist, and I think blacklisting may be 
less frequent than headroom calculation.  Thoughts?

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Chen He
>         Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to