[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157286#comment-14157286
 ] 

Craig Welch commented on YARN-1680:
-----------------------------------

This does bring up what I think could be an issue, I'm not sure if it was what 
you were getting at before or not, [~john.jian.fang], but we could well be 
introducing a new bug here unless we are careful.  I don't see any connection 
between the scheduler level resource adjustments and the application level 
adjustments, so if an application had problems with a node and blacklisted it, 
and then the cluster did, the resource value of the node would be effectively 
removed from the headroom 2x (once when the application adds it to it's new 
"blacklist reduction", and a second time when the cluster removes it's value 
from the clusterResource).  I think this could be a problem, I think it could 
be addressed, but it's something to think about and I don't think the current 
approach addresses this- [~airbots], [~jlowe], thoughts?

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Chen He
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
> YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to