[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531635#comment-14531635 ]
Craig Welch commented on YARN-1680: ----------------------------------- [~leftnoteasy] bq. Actually I think this statement may not true, assume we compute an accurate headroom for app, but that doesn't mean the app can get as much resource as we compute...you may not be able to get it after hours. This would only occur if other applications were allocated those resources, in which case the headroom will drop and the application will be made aware of it via headroom updates. The scenario you propose as a counter example is inaccurate. It is the case that accurate headroom (including a fix for the blacklist issue here) will result in faster overall job completion than the reactionary approach with allocation failure. [~vinodkv] bq. OTOH, blacklisting / hard-locality are app-decisions. From the platform's perspective, those nodes, free or otherwise, are actually available for apps to use Not quite so, as the scheduler respects the blacklist and doesn't allocate containers to an app when it would run counter to the apps blacklisting That said, so far the discussion regarding the proposal has largely been about where the activity should live, let's put that aside for a moment and concentrate on the approach itself. With api additions / additional library work / etc it should be possible to do the same thing outside the scheduler as within. Whether and what to do in or out of the scheduler needs to be settled still, of course, but a decision on how the headroom will be adjusted is needed in any case, and and is needed before putting together the change wherever it ends up living. So: "where app headroom is finalized" == in the scheduler OR in a library available/used by AM's. if externalized, obviously api's to get whatever info is not yet available outside the scheduler will need to be added Retain a node/rack blacklist where app headroom is finalized (already the case) Add a "last change" timestamp or incrementing counter to track node addition/removal at the cluster level (which is what exists for "cluster black/white" listing afaict), updated when those events occur Add a "last change" timestamp/counter to where app headroom is finalized to track blacklist changes have "last updated" values on where app headroom is finalized to track the above two "last change" values, updated when blacklist values are recalculated On headroom calculation, where app headroom is finalized checks if it has any entries in the blacklist or if it has a "blacklist deduction" value in it's resourceusage entry (see below), to determine if blacklist must be taken into account if blacklist must be taken into account, check the "last updated" values for both cluster and app blacklist changes, if and only if either is stale (last updated != last change) then recalculate the blacklist deduction when calculating the blacklist deduction use Chen He basic logic from existing patches. Place the deduction value into where app headroom is finalized. NodeLables could be taken into account as well, only blacklist entries which match the nodelabel expression used by the application would be added to the deduction, if a nl expression is in play whenever the headroom is generated where app headroom is finalized, perform the blacklist value deduction > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > ------------------------------------------------------------------------------------------------------ > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler > Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 > Reporter: Rohith > Assignee: Craig Welch > Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, > YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)