[jira] [Commented] (YARN-6507) Add support in NodeManager to isolate FPGA devices with CGroups

2017-11-25 Thread Zhankun Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265938#comment-16265938
 ] 

Zhankun Tang commented on YARN-6507:


[~wangda], the end-to-end test reported is attached in YARN-5983. Please check.

> Add support in NodeManager to isolate FPGA devices with CGroups
> ---
>
> Key: YARN-6507
> URL: https://issues.apache.org/jira/browse/YARN-6507
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
> Attachments: YARN-6507-branch-YARN-3926.001.patch, 
> YARN-6507-branch-YARN-3926.002.patch, YARN-6507-trunk.001.patch, 
> YARN-6507-trunk.002.patch, YARN-6507-trunk.003.patch, 
> YARN-6507-trunk.004.patch, YARN-6507-trunk.005.patch, 
> YARN-6507-trunk.006.patch, YARN-6507-trunk.007.patch, 
> YARN-6507-trunk.008.patch, YARN-6507-trunk.009.patch
>
>
> Support local FPGA resource scheduler to assign/isolate N FPGA slots to a 
> container.
> At the beginning, support one vendor plugin with basic features to serve 
> OpenCL applications



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5983) [Umbrella] Support for FPGA as a Resource in YARN

2017-11-25 Thread Zhankun Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-5983:
---
Attachment: YARN-5983_end-to-end_test_report.pdf

Add an end-to-end test report for your reference.

> [Umbrella] Support for FPGA as a Resource in YARN
> -
>
> Key: YARN-5983
> URL: https://issues.apache.org/jira/browse/YARN-5983
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
> Attachments: YARN-5983-Support-FPGA-resource-on-NM-side_v1.pdf, 
> YARN-5983-implementation-notes.pdf, YARN-5983_end-to-end_test_report.pdf
>
>
> As various big data workload running on YARN, CPU will no longer scale 
> eventually and heterogeneous systems will become more important. ML/DL is a 
> rising star in recent years, applications focused on these areas have to 
> utilize GPU or FPGA to boost performance. Also, hardware vendors such as 
> Intel also invest in such hardware. It is most likely that FPGA will become 
> popular in data centers like CPU in the near future.
> So YARN as a resource managing and scheduling system, would be great to 
> evolve to support this. This JIRA proposes FPGA to be a first-class citizen. 
> The changes roughly includes:
> 1. FPGA resource detection and heartbeat
> 2. Scheduler changes
> 3. FPGA related preparation and isolation before launch container
> We know that YARN-3926 is trying to extend current resource model. But still 
> we can leave some FPGA related discussion here



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu reassigned YARN-7560:
--

Assignee: zhengchenyu

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
>Assignee: zhengchenyu
> Fix For: 3.0.0
>
> Attachments: YARN-7560.000.patch
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow value, just 
> a negative. So the loop will never break.
> {code}
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
> < totalResource) {
>   rMax *= 2.0;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread zhengchenyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated YARN-7560:
--
Attachment: YARN-7560.000.patch

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
> Attachments: YARN-7560.000.patch
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow value, just 
> a negative. So the loop will never break.
> {code}
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
> < totalResource) {
>   rMax *= 2.0;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org

[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread zhengchenyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated YARN-7560:
--
Attachment: (was: YARN-7560.patch.00)

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow value, just 
> a negative. So the loop will never break.
> {code}
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
> < totalResource) {
>   rMax *= 2.0;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265922#comment-16265922
 ] 

Yufei Gu commented on YARN-7560:


[~zhengchenyu], I've added you as a contributor. You can assign this to 
yourself.

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
> Attachments: YARN-7560.patch.00
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow value, just 
> a negative. So the loop will never break.
> {code}
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
> < totalResource) {
>   rMax *= 2.0;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265921#comment-16265921
 ] 

Yufei Gu edited comment on YARN-7560 at 11/26/17 6:09 AM:
--

Thanks for filing this issue and provide the patch, [~zhengchenyu]. Can you 
rename your patch to something like YARN-7560.xxx.patch, so that Hadoop QA can 
kick in. 


was (Author: yufeigu):
Thanks for filing this issue and provide the patch, [~zhengchenyu]. Can you 
remove your patch to something like YARN-7560.xxx.patch, so that Hadoop QA can 
kick in. 

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
> Attachments: YARN-7560.patch.00
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow 

[jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265921#comment-16265921
 ] 

Yufei Gu commented on YARN-7560:


Thanks for filing this issue and provide the patch, [~zhengchenyu]. Can you 
remove your patch to something like YARN-7560.xxx.patch, so that Hadoop QA can 
kick in. 

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
> Attachments: YARN-7560.patch.00
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow value, just 
> a negative. So the loop will never break.
> {code}
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
> < totalResource) {
>   rMax *= 2.0;
> }
> {code}



--
This message was sent by 

[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread zhengchenyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated YARN-7560:
--
Attachment: YARN-7560.patch.00

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
> Attachments: YARN-7560.patch.00
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow value, just 
> a negative. So the loop will never break.
> {code}
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
> < totalResource) {
>   rMax *= 2.0;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org

[jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread zhengchenyu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265918#comment-16265918
 ] 

zhengchenyu commented on YARN-7560:
---

[~yufeigu]
the sum of all queue's minRes is over int.max

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow value, just 
> a negative. So the loop will never break.
> {code}
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
> < totalResource) {
>   rMax *= 2.0;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread zhengchenyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated YARN-7560:
--
Description: 
In our cluster, we changed the configuration, then refreshQueues, we found the 
resourcemanager hangs. And the Resourcemanager can't restart successfully. We 
got jstack information, always show like this:
{code}
"main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
[0x7f98eed9a000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
- locked <0x7f8c4a8177a0> (a java.util.HashMap)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
- locked <0x7f8c4a7eb2e0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
- locked <0x7f8c4a76ac48> (a java.lang.Object)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
- locked <0x7f8c49254268> (a java.lang.Object)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
- locked <0x7f8c467495e0> (a java.lang.Object)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
{code}

When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
return a negative value. So the loop can't return. We found in our cluster, the 
sum of all minRes is over int.max, so resourceUsedWithWeightToResourceRatio 
return a negative value.

below is the loop. Because totalResource is long, so always postive. But 
resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
that resourceUsedWithWeightToResourceRatio will return a overflow value, just a 
negative. So the loop will never break.
{code}
while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
< totalResource) {
  rMax *= 2.0;
}
{code}



  was:
In our cluster, we changed the configuration, then refreshQueues, we found the 
resourcemanager hangs. And the Resourcemanager can't restart successfully. We 
got jstack information, like this:
{code}
"main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
[0x7f98eed9a000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
at 

[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread zhengchenyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated YARN-7560:
--
Target Version/s: 3.0.0  (was: 2.7.5)

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> all minRes is over int.max. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread zhengchenyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated YARN-7560:
--
Affects Version/s: (was: 2.7.1)

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> all minRes is over int.max. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread zhengchenyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated YARN-7560:
--
Fix Version/s: (was: 2.7.5)
   3.0.0

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
> Fix For: 3.0.0
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> all minRes is over int.max. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265910#comment-16265910
 ] 

Yufei Gu commented on YARN-7560:


Which version is used? 2.7.1 or 3.0? What do you mean all minRes is over 
int.max? Do you intentionally make it so? 

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.7.1, 3.0.0
>Reporter: zhengchenyu
> Fix For: 2.7.5
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> all minRes is over int.max. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu updated YARN-7560:
---
Component/s: fairscheduler

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.7.1, 3.0.0
>Reporter: zhengchenyu
> Fix For: 2.7.5
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> all minRes is over int.max. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7218) ApiServer REST API naming convention /ws/v1 is already used in Hadoop v2

2017-11-25 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7218:

Fix Version/s: 3.1.0

> ApiServer REST API naming convention /ws/v1 is already used in Hadoop v2
> 
>
> Key: YARN-7218
> URL: https://issues.apache.org/jira/browse/YARN-7218
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, applications
>Reporter: Eric Yang
>Assignee: Eric Yang
> Fix For: 3.1.0
>
> Attachments: YARN-7218.001.patch, YARN-7218.002.patch, 
> YARN-7218.003.patch, YARN-7218.004.patch
>
>
> In YARN-6626, there is a desire to have ability to run ApiServer REST API in 
> Resource Manager, this can eliminate the requirement to deploy another daemon 
> service for submitting docker applications.  In YARN-5698, a new UI has been 
> implemented as a separate web application.  There are some problems in the 
> arrangement that can cause conflicts of how Java session are being managed.  
> The root context of Resource Manager web application is /ws.  This is hard 
> coded in startWebapp method in ResourceManager.java.  This means all the 
> session management is applied to Web URL of /ws prefix.  /ui2 is independent 
> of /ws context, therefore session management code doesn't apply to /ui2.  
> This could be a session management problem, if servlet based code is going to 
> be introduced into /ui2 web application.
> ApiServer code base is designed as a separate web application.  There is no 
> easy way to inject a separate web application into the same /ws context 
> because ResourceManager is already setup to bind to RMWebServices.  Unless 
> ApiServer code is moved into RMWebServices, otherwise, they will not share 
> the same session management.
> The alternate solution is to keep ApiServer prefix URL independent of /ws 
> context.  However, this will be a departure from YARN web services naming 
> convention.  This can be loaded as a separate web application in Resource 
> Manager jetty server.  One possible proposal is /app/v1/services.  This can 
> keep ApiServer code modular and independent from Resource Manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2017-11-25 Thread zhengchenyu (JIRA)
zhengchenyu created YARN-7560:
-

 Summary: Resourcemanager hangs when  
resourceUsedWithWeightToResourceRatio return a overflow value 
 Key: YARN-7560
 URL: https://issues.apache.org/jira/browse/YARN-7560
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1, 3.0.0
Reporter: zhengchenyu
 Fix For: 2.7.5


In our cluster, we changed the configuration, then refreshQueues, we found the 
resourcemanager hangs. And the Resourcemanager can't restart successfully. We 
got jstack information, like this:
{code}
"main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
[0x7f98eed9a000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
- locked <0x7f8c4a8177a0> (a java.util.HashMap)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
- locked <0x7f8c4a7eb2e0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
- locked <0x7f8c4a76ac48> (a java.lang.Object)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
- locked <0x7f8c49254268> (a java.lang.Object)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
- locked <0x7f8c467495e0> (a java.lang.Object)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
{code}
When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
return a negative value. So the loop can't return. We found in our cluster, all 
minRes is over int.max. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher

2017-11-25 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee reassigned YARN-7229:
--

Assignee: sandflee

> Add a metric for the size of event queue in AsyncDispatcher
> ---
>
> Key: YARN-7229
> URL: https://issues.apache.org/jira/browse/YARN-7229
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>Assignee: sandflee
>
> The size of event queue in AsyncDispatcher is a good point to monitor daemon 
> performance. Let's make it a RM metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations

2017-11-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265837#comment-16265837
 ] 

Eric Payne commented on YARN-7496:
--

Thank you very much, [~leftnoteasy]

> CS Intra-queue preemption user-limit calculations are not in line with 
> LeafQueue user-limit calculations
> 
>
> Key: YARN-7496
> URL: https://issues.apache.org/jira/browse/YARN-7496
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.8.3
>
> Attachments: YARN-7496.001.branch-2.8.patch
>
>
> Only a problem in 2.8.
> Preemption could oscillate due to the difference in how user limit is 
> calculated between 2.8 and later releases.
> Basically (ignoring ULF, MULP, and maybe others), the calculation for user 
> limit on the Capacity Scheduler side in 2.8 is {{total used resources / 
> number of active users}} while the calculation in later releases is {{total 
> active resources / number of active users}}. When intra-queue preemption was 
> backported to 2.8, it's calculations for user limit were more aligned with 
> the latter algorithm, which is in 2.9 and later releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7505) RM REST endpoints generate malformed JSON

2017-11-25 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265756#comment-16265756
 ] 

Daniel Templeton commented on YARN-7505:


Now that I'm into the implementation, I can see that this change would be 
better to target for 3.1. If we're going to introduce a v2, we should at least 
allow some time for any pent up incompatible changes to surface and be handled. 
 There are likely several places where resource types makes more sense in the 
API with an incompatible change.  Let's hold off on this one until 3.1 and try 
to get the APIs cleaned up between now and then.

> RM REST endpoints generate malformed JSON
> -
>
> Key: YARN-7505
> URL: https://issues.apache.org/jira/browse/YARN-7505
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: restapi
>Affects Versions: 3.0.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-7505.001.patch, YARN-7505.002.patch
>
>
> For all endpoints that return DAOs that contain maps, the generated JSON is 
> malformed.  For example:
> % curl 'http://localhost:8088/ws/v1/cluster/apps'
> {"apps":{"app":[{"id":"application_1510777276702_0001","user":"daniel","name":"QuasiMonteCarlo","queue":"root.daniel","state":"RUNNING","finalStatus":"UNDEFINED","progress":5.0,"trackingUI":"ApplicationMaster","trackingUrl":"http://dhcp-10-16-0-181.pa.cloudera.com:8088/proxy/application_1510777276702_0001/","diagnostics":"","clusterId":1510777276702,"applicationType":"MAPREDUCE","applicationTags":"","priority":0,"startedTime":1510777317853,"finishedTime":0,"elapsedTime":21623,"amContainerLogs":"http://dhcp-10-16-0-181.pa.cloudera.com:8042/node/containerlogs/container_1510777276702_0001_01_01/daniel","amHostHttpAddress":"dhcp-10-16-0-181.pa.cloudera.com:8042","amRPCAddress":"dhcp-10-16-0-181.pa.cloudera.com:63371","allocatedMB":5120,"allocatedVCores":4,"reservedMB":0,"reservedVCores":0,"runningContainers":4,"memorySeconds":49820,"vcoreSeconds":26,"queueUsagePercentage":62.5,"clusterUsagePercentage":62.5,"resourceSecondsMap":{"entry":{"key":"test2","value":"0"},"entry":{"key":"test","value":"0"},"entry":{"key":"memory-mb","value":"49820"},"entry":{"key":"vcores","value":"26"}},"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"preemptedMemorySeconds":0,"preemptedVcoreSeconds":0,"preemptedResourceSecondsMap":{},"resourceRequests":[{"priority":20,"resourceName":"dhcp-10-16-0-181.pa.cloudera.com","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"/default-rack","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"*","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false}],"logAggregationStatus":"DISABLED","unmanagedApplication":false,"amNodeLabelExpression":"","timeouts":{"timeout":[{"type":"LIFETIME","expiryTime":"UNLIMITED","remainingTimeInSeconds":-1}]}}]}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7505) RM REST endpoints generate malformed JSON

2017-11-25 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-7505:
---
Target Version/s: 3.1.0  (was: 3.0.0)

> RM REST endpoints generate malformed JSON
> -
>
> Key: YARN-7505
> URL: https://issues.apache.org/jira/browse/YARN-7505
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: restapi
>Affects Versions: 3.0.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-7505.001.patch, YARN-7505.002.patch
>
>
> For all endpoints that return DAOs that contain maps, the generated JSON is 
> malformed.  For example:
> % curl 'http://localhost:8088/ws/v1/cluster/apps'
> {"apps":{"app":[{"id":"application_1510777276702_0001","user":"daniel","name":"QuasiMonteCarlo","queue":"root.daniel","state":"RUNNING","finalStatus":"UNDEFINED","progress":5.0,"trackingUI":"ApplicationMaster","trackingUrl":"http://dhcp-10-16-0-181.pa.cloudera.com:8088/proxy/application_1510777276702_0001/","diagnostics":"","clusterId":1510777276702,"applicationType":"MAPREDUCE","applicationTags":"","priority":0,"startedTime":1510777317853,"finishedTime":0,"elapsedTime":21623,"amContainerLogs":"http://dhcp-10-16-0-181.pa.cloudera.com:8042/node/containerlogs/container_1510777276702_0001_01_01/daniel","amHostHttpAddress":"dhcp-10-16-0-181.pa.cloudera.com:8042","amRPCAddress":"dhcp-10-16-0-181.pa.cloudera.com:63371","allocatedMB":5120,"allocatedVCores":4,"reservedMB":0,"reservedVCores":0,"runningContainers":4,"memorySeconds":49820,"vcoreSeconds":26,"queueUsagePercentage":62.5,"clusterUsagePercentage":62.5,"resourceSecondsMap":{"entry":{"key":"test2","value":"0"},"entry":{"key":"test","value":"0"},"entry":{"key":"memory-mb","value":"49820"},"entry":{"key":"vcores","value":"26"}},"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"preemptedMemorySeconds":0,"preemptedVcoreSeconds":0,"preemptedResourceSecondsMap":{},"resourceRequests":[{"priority":20,"resourceName":"dhcp-10-16-0-181.pa.cloudera.com","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"/default-rack","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"*","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false}],"logAggregationStatus":"DISABLED","unmanagedApplication":false,"amNodeLabelExpression":"","timeouts":{"timeout":[{"type":"LIFETIME","expiryTime":"UNLIMITED","remainingTimeInSeconds":-1}]}}]}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6647) RM can crash during transitionToStandby due to InterruptedException

2017-11-25 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-6647:
---
Target Version/s: 3.0.0

Marking target as 3.0.0

> RM can crash during transitionToStandby due to InterruptedException
> ---
>
> Key: YARN-6647
> URL: https://issues.apache.org/jira/browse/YARN-6647
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4
>Reporter: Jason Lowe
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-6647.001.patch, YARN-6647.002.patch, 
> YARN-6647.003.patch, YARN-6647.004.patch, YARN-6647.005.patch
>
>
> Noticed some tests were failing due to the JVM shutting down early.  I was 
> able to reproduce this occasionally with TestKillApplicationWithRMHA.  
> Stacktrace to follow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7541) Node updates don't update the maximum cluster capability for resources other than CPU and memory

2017-11-25 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265619#comment-16265619
 ] 

Yufei Gu commented on YARN-7541:


Jenkins failed https://builds.apache.org/job/PreCommit-YARN-Build/18660/console

> Node updates don't update the maximum cluster capability for resources other 
> than CPU and memory
> 
>
> Key: YARN-7541
> URL: https://issues.apache.org/jira/browse/YARN-7541
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.0.0-beta1, 3.1.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-7541.001.patch, YARN-7541.002.patch, 
> YARN-7541.003.patch, YARN-7541.004.patch, YARN-7541.005.patch
>
>
> When I submit an MR job that asks for too much memory or CPU for the map or 
> reduce, the AM will fail because it recognizes that the request is too large. 
>  With any other resources, however, the resource requests will instead be 
> made and remain pending forever.  Looks like we forgot to update the code 
> that tracks the maximum container allocation in {{ClusterNodeTracker}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org