[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732396#comment-14732396
 ] 

Karthik Kambatla commented on YARN-1680:
----------------------------------------

MAPREDUCE-6302 would resolve deadlocks, but only reactively. I believe we 
should use that only as a safeguard to guard against future headroom issues.

We should fix this regardless. Went through the discussions here. I vote for 
the scheduler accounting for the blacklisted nodes in the headroom calculation. 
If the app is to subtract these resources from the headroom, it might as well 
maintain the blacklist itself and relieve the scheduler of those details as 
well. Also, as Jian mentioned, it is better to do this in one place (in the 
scheduler) than have each app handle it. 

[~vinodkv] - do you agree? 

Accordingly, we would like to make progress on YARN-3446. 

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacityscheduler
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith Sharma K S
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
> YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to