[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139605#comment-14139605 ]
Jian He commented on YARN-1857: ------------------------------- thanks [~airbots] and [~cwelch], patch looks good overall, few comments and questions: - Indentation of the last line seems incorrect. {code} Resource headroom = Resources.min(resourceCalculator, clusterResource, Resources.subtract( Resources.min(resourceCalculator, clusterResource, userLimit, queueMaxCap), userConsumed), Resources.subtract(queueMaxCap, usedResources)); {code} - Test case2: could you check app2 headRoom as well - Test case3: could you check app_1 headRoom as well. - Could you explain why in test case 4 {{assertEquals(5*GB, app_4.getHeadroom().getMemory());}}, app4 still has 5GB headRoom? > CapacityScheduler headroom doesn't account for other AM's running > ----------------------------------------------------------------- > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler > Affects Versions: 2.3.0 > Reporter: Thomas Graves > Assignee: Chen He > Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)