[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-10-07 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1857:
--
Attachment: YARN-1857.7.patch


I see it now - uploading newer patch with simplified final headroom calculation

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, 
 YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.7.patch, 
 YARN-1857.patch, YARN-1857.patch, YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-10-06 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1857:
--
Attachment: YARN-1857.5.patch

Updating to current trunk on new(er) repo

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, 
 YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.patch, YARN-1857.patch, 
 YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-10-03 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1857:
--
Attachment: YARN-1857.4.patch

[~john.jian.fang] I've attached a .4 patch with the additional checks you 
requested in the test.  The test also has comments for the headroom calculation 
to make it clear why the values are what they are (hopefully :-) ) - including 
the one you had specifically called out as looking off.  I also added a call to 
calculate headroom for an additional app in the final test because without that 
it still just had it's prior (stale) values.  The only changes are to the test, 
and it is the same as the one I just uploaded on [YARN-1198]

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, 
 YARN-1857.4.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-09-16 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1857:
--
Attachment: YARN-1857.3.patch

re attaching patch to get the javac warnings

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, 
 YARN-1857.patch, YARN-1857.patch, YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-09-12 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1857:
--
Attachment: YARN-1857.2.patch

I went ahead and put in a comment wrt a headroom value in the test and changed 
the queue variable to qb.  [~airbots], [~jianhe], can you take a look at 
patch .2?

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.patch, 
 YARN-1857.patch, YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-08-25 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1857:
--

Target Version/s: 2.6.0  (was: 2.4.1)

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.1.patch, YARN-1857.patch, YARN-1857.patch, 
 YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-08-14 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1857:
--

Attachment: YARN-1857.1.patch

Just updating to a patch which applies against current trunk, otherwise 
unchanged

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.1.patch, YARN-1857.patch, YARN-1857.patch, 
 YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-07-08 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1857:
--

Attachment: YARN-1857.patch

trigger QA

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.patch, YARN-1857.patch, YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-06-11 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated YARN-1857:
--

Priority: Critical  (was: Major)

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.patch, YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-06-11 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated YARN-1857:
--

Target Version/s: 2.4.1  (was: 2.4.0)

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
 Attachments: YARN-1857.patch, YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-05-05 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1857:
--

Attachment: YARN-1857.patch

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
 Attachments: YARN-1857.patch, YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-05-02 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1857:
--

Attachment: YARN-1857.patch

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
 Attachments: YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-03-19 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1857:


Description: 
Its possible to get an application to hang forever (or a long time) in a 
cluster with multiple users.  The reason why is that the headroom sent to the 
application is based on the user limit but it doesn't account for other 
Application masters using space in that queue.  So the headroom (user limit - 
user consumed) can be  0 even though the cluster is 100% full because the 
other space is being used by application masters from other users.  

For instance if you have a cluster with 1 queue, user limit is 100%, you have 
multiple users submitting applications.  One very large application by user 1 
starts up, runs most of its maps and starts running reducers. other users try 
to start applications and get their application masters started but not tasks.  
The very large application then gets to the point where it has consumed the 
rest of the cluster resources with all reduces.  But at this point it needs to 
still finish a few maps.  The headroom being sent to this application is only 
based on the user limit (which is 100% of the cluster capacity) its using lets 
say 95% of the cluster for reduces and then other 5% is being used by other 
users running application masters.  The MRAppMaster thinks it still has 5% so 
it doesn't know that it should kill a reduce in order to run a map.  

This can happen in other scenarios also.  Generally in a large cluster with 
multiple queues this shouldn't cause a hang forever but it could cause the 
application to take much longer.

  was:
Its possible to get an application to hang forever (or a long time) in a 
cluster with multiple users.  The reason why is that the headroom sent to the 
application is based on the user limit but it doesn't account for other 
Application masters using space in that queue.  So the headroom (user limit 
(100%) - user consumed) can be  0 even though the cluster is 100% full because 
the other space is being used by application masters from other users.  

For instance if you have a cluster with 1 queue, user limit is 100%, you have 
multiple users submitting applications.  One very large application by user 1 
starts up, runs most of its maps and starts running reducers. other users try 
to start applications and get their application masters started but not tasks.  
The very large application then gets to the point where it has consumed the 
rest of the cluster resources with all reduces.  But at this point it needs to 
still finish a few maps.  The headroom being sent to this application is only 
based on the user limit (which is 100% of the cluster capacity) its using lets 
say 95% of the cluster for reduces and then other 5% is being used by other 
users running application masters.  The MRAppMaster thinks it still has 5% so 
it doesn't know that it should kill a reduce in order to run a map.  

This can happen in other scenarios also.  Generally in a large cluster with 
multiple queues this shouldn't cause a hang forever but it could cause the 
application to take much longer.


 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves

 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  

[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-03-19 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1857:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-1198

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves

 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)