[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163591#comment-14163591 ] Hudson commented on YARN-1857: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1920 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1920/]) YARN-1857. CapacityScheduler headroom doesn't account for other AM's running. Contributed by Chen He and Craig Welch (jianhe: rev 30d56fdbb40d06c4e267d6c314c8c767a7adc6a3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Fix For: 2.6.0 > > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163513#comment-14163513 ] Hudson commented on YARN-1857: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1895 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1895/]) YARN-1857. CapacityScheduler headroom doesn't account for other AM's running. Contributed by Chen He and Craig Welch (jianhe: rev 30d56fdbb40d06c4e267d6c314c8c767a7adc6a3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/CHANGES.txt > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Fix For: 2.6.0 > > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163360#comment-14163360 ] Hudson commented on YARN-1857: -- FAILURE: Integrated in Hadoop-Yarn-trunk #705 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/705/]) YARN-1857. CapacityScheduler headroom doesn't account for other AM's running. Contributed by Chen He and Craig Welch (jianhe: rev 30d56fdbb40d06c4e267d6c314c8c767a7adc6a3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Fix For: 2.6.0 > > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162510#comment-14162510 ] Hudson commented on YARN-1857: -- FAILURE: Integrated in Hadoop-trunk-Commit #6206 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6206/]) YARN-1857. CapacityScheduler headroom doesn't account for other AM's running. Contributed by Chen He and Craig Welch (jianhe: rev 30d56fdbb40d06c4e267d6c314c8c767a7adc6a3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Fix For: 2.6.0 > > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162497#comment-14162497 ] Chen He commented on YARN-1857: --- Sorry, my bad, looks like YARN-2400 is checked in. Anyway, it is not related to this patch. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162496#comment-14162496 ] Chen He commented on YARN-1857: --- Unit test failure is because of YARN-2400 > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162487#comment-14162487 ] Jian He commented on YARN-1857: --- Craig, thanks for updating. looks good, +1 > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162461#comment-14162461 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673410/YARN-1857.7.patch against trunk revision 9196db9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5311//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5311//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162357#comment-14162357 ] Chen He commented on YARN-1857: --- Hi [~cwelch], hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/CapacityScheduler.apt.vm has the detail information about these parameters. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162170#comment-14162170 ] Craig Welch commented on YARN-1857: --- This is an interesting question - that logic predates this change, and I wondered if there were cases when userLimit could somehow be > queueMaxCap, and as I look at the code, surprisingly, I believe so. Userlimit is calculated based on absolute queue values whereas, at least since [YARN-2008], queueMaxCap takes into account actual useage in other queues. So, it is entirely possible for userLimit to be > queueMaxCap due to how they are calculated, at least post [YARN-2008]. I'm not sure if pre-2008 that was possible as well, it may have been, there is a bit to how that was calculated even before that change - in any event, it is the case now. So, as it happens, I don't believe we can do the simplification. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.patch, > YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161496#comment-14161496 ] Jian He commented on YARN-1857: --- I found given that {{queueUsedResources >= userConsumed}}, we can simplify the formula to {code} min (userlimit - userConsumed, queueMaxCap- queueUsedResources) {code}, does this make sense ? > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.patch, > YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161373#comment-14161373 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673238/YARN-1857.6.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5296//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5296//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.patch, > YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161300#comment-14161300 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673238/YARN-1857.6.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5292//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5292//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.patch, > YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161220#comment-14161220 ] Craig Welch commented on YARN-1857: --- [~john.jian.fang] - uploaded .6 on [YARN-2644], updated headroom calculation comment, fixed indentation > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.patch, > YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161164#comment-14161164 ] Jian He commented on YARN-1857: --- could you please update the patch on top of YARN-2644 ? comments in the meanwhile: - update the code comments about the new calculation of headroom {code} /** * Headroom is min((userLimit, queue-max-cap) - consumed) */ {code} - indentation of this line {{Resources.subtract(queueMaxCap, usedResources));}} > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.patch, YARN-1857.patch, > YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160819#comment-14160819 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673170/YARN-1857.5.patch against trunk revision ea26cc0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5279//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.patch, YARN-1857.patch, > YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158644#comment-14158644 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672863/YARN-1857.4.patch against trunk revision 7f6ed7f. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5256//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.4.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139605#comment-14139605 ] Jian He commented on YARN-1857: --- thanks [~airbots] and [~cwelch], patch looks good overall, few comments and questions: - Indentation of the last line seems incorrect. {code} Resource headroom = Resources.min(resourceCalculator, clusterResource, Resources.subtract( Resources.min(resourceCalculator, clusterResource, userLimit, queueMaxCap), userConsumed), Resources.subtract(queueMaxCap, usedResources)); {code} - Test case2: could you check app2 headRoom as well - Test case3: could you check app_1 headRoom as well. - Could you explain why in test case 4 {{assertEquals(5*GB, app_4.getHeadroom().getMemory());}}, app4 still has 5GB headRoom? > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136807#comment-14136807 ] Hadoop QA commented on YARN-1857: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669331/YARN-1857.3.patch against trunk revision 0e7d1db. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4984//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4984//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, > YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132401#comment-14132401 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668487/YARN-1857.2.patch against trunk revision a0ad975. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1265 javac compiler warnings (more than the trunk's current 1264 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDecommission {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4946//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4946//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4946//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.patch, > YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114431#comment-14114431 ] Chen He commented on YARN-1857: --- Sure, it has been a while since I created this patch for the first time. Let me make the updates. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.patch, YARN-1857.patch, > YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114418#comment-14114418 ] Jian He commented on YARN-1857: --- Hi [~airbots], thanks for working on this, Can you add more comments in the test about how the numbers are calculated ? it's not easy to follow. And maybe rename LeafQueue a to b, as it is getting queueB. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.patch, YARN-1857.patch, > YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109798#comment-14109798 ] Craig Welch commented on YARN-1857: --- [~jianhe] [~wangda] could you have a look at this patch? > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.1.patch, YARN-1857.patch, YARN-1857.patch, > YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055348#comment-14055348 ] Hadoop QA commented on YARN-1857: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654635/YARN-1857.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4223//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4223//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.patch, YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027943#comment-14027943 ] Jonathan Eagles commented on YARN-1857: --- Bumping the priority since reducer preemption is broken in many cases without this fix. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He >Priority: Critical > Attachments: YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990304#comment-13990304 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643410/YARN-1857.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3697//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3697//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He > Attachments: YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989865#comment-13989865 ] Chen He commented on YARN-1857: --- This failure is related to YARN-1906. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He > Attachments: YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988124#comment-13988124 ] Chen He commented on YARN-1857: --- The TestRMRestart successfully passed on my laptop. I think this failure may not be related to my patch. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He > Attachments: YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988096#comment-13988096 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643084/YARN-1857.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3682//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3682//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He > Attachments: YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940984#comment-13940984 ] Vinod Kumar Vavilapalli commented on YARN-1857: --- This is just one of the items tracked at YARN-1198. Will convert it as a sub-task. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)