[jira] [Resolved] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve

2022-03-20 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li resolved YARN-11091.
--
Resolution: Duplicate

> NPE at FiCaSchedulerApp#findNodeToUnreserve
> ---
>
> Key: YARN-11091
> URL: https://issues.apache.org/jira/browse/YARN-11091
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.1
>Reporter: Bo Li
>Priority: Critical
>
> When nodemanager hadoop123 shutdown and it looks like  goes in to a loop and 
> hit NPE after hadoop123 back to work since FiCaSchedulerNode is not null 
> anymore.
>  {quote}
> 2022-03-15 23:35:25,488 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com:8043
> 2022-03-15 23:35:25,490 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com:8043
> 2022-03-15 23:35:25,492 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com:8043
> 2022-03-15 23:35:25,495 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com
> 2022-03-15 23:35:25,499 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> hadoop123.com Node Transitioned from NEW to RUNNING
> 2022-03-15 23:35:25,499 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered 
> with capability: , assigned nodeId 
> hadoop123.com:8043
> 2022-03-15 23:35:25,515 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-15,5,main] threw an Exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660)
> 

[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve

2022-03-16 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11091:
-
Description: 
When nodemanager hadoop123 shutdown and it looks like  goes in to a loop and 
hit NPE after hadoop123 back to work since FiCaSchedulerNode is not null 
anymore.
 {quote}
2022-03-15 23:35:25,488 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com:8043
2022-03-15 23:35:25,490 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com:8043
2022-03-15 23:35:25,492 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com:8043
2022-03-15 23:35:25,495 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com
2022-03-15 23:35:25,499 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop123.com 
Node Transitioned from NEW to RUNNING
2022-03-15 23:35:25,499 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered 
with capability: , assigned nodeId 
hadoop123.com:8043
2022-03-15 23:35:25,515 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-15,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1409)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)

{quote}

  was:
When nodemanager hadoop123 shutdown and it looks like  goes in to a loop and 
hit NPE after nodemanager x restart since FiCaSchedulerNode is not null anymore.
 {quote}
2022-03-15 23:35:25,488 ERROR 

[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve

2022-03-16 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11091:
-
Component/s: capacity scheduler

> NPE at FiCaSchedulerApp#findNodeToUnreserve
> ---
>
> Key: YARN-11091
> URL: https://issues.apache.org/jira/browse/YARN-11091
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.1
>Reporter: Bo Li
>Priority: Critical
>
> When nodemanager hadoop123 shutdown and it looks like  goes in to a loop and 
> hit NPE after nodemanager x restart since FiCaSchedulerNode is not null 
> anymore.
>  {quote}
> 2022-03-15 23:35:25,488 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com:8043
> 2022-03-15 23:35:25,490 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com:8043
> 2022-03-15 23:35:25,492 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com:8043
> 2022-03-15 23:35:25,495 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com
> 2022-03-15 23:35:25,499 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> hadoop123.com Node Transitioned from NEW to RUNNING
> 2022-03-15 23:35:25,499 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered 
> with capability: , assigned nodeId 
> hadoop123.com:8043
> 2022-03-15 23:35:25,515 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-15,5,main] threw an Exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560)
> at 
> 

[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve

2022-03-16 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11091:
-
Description: 
When nodemanager hadoop123 shutdown and it looks like  goes in to a loop and 
hit NPE after nodemanager x restart since FiCaSchedulerNode is not null anymore.
 {quote}
2022-03-15 23:35:25,488 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com:8043
2022-03-15 23:35:25,490 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com:8043
2022-03-15 23:35:25,492 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com:8043
2022-03-15 23:35:25,495 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com
2022-03-15 23:35:25,499 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop123.com 
Node Transitioned from NEW to RUNNING
2022-03-15 23:35:25,499 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered 
with capability: , assigned nodeId 
hadoop123.com:8043
2022-03-15 23:35:25,515 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-15,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1409)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)

{quote}

  was:
When nodemanager x shutdown and look like it goes in to a loop and hit NPE 
after nodemanager x restart.
 {quote}
2022-03-15 23:35:25,488 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to 

[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve

2022-03-16 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11091:
-
Affects Version/s: 3.1.1

> NPE at FiCaSchedulerApp#findNodeToUnreserve
> ---
>
> Key: YARN-11091
> URL: https://issues.apache.org/jira/browse/YARN-11091
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Bo Li
>Priority: Critical
>
> When nodemanager hadoop123 shutdown and it looks like  goes in to a loop and 
> hit NPE after nodemanager x restart since FiCaSchedulerNode is not null 
> anymore.
>  {quote}
> 2022-03-15 23:35:25,488 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com:8043
> 2022-03-15 23:35:25,490 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com:8043
> 2022-03-15 23:35:25,492 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com:8043
> 2022-03-15 23:35:25,495 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: hadoop123.com
> 2022-03-15 23:35:25,499 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> hadoop123.com Node Transitioned from NEW to RUNNING
> 2022-03-15 23:35:25,499 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered 
> with capability: , assigned nodeId 
> hadoop123.com:8043
> 2022-03-15 23:35:25,515 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-15,5,main] threw an Exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660)
> at 
> 

[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve

2022-03-16 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11091:
-
Description: 
When nodemanager x shutdown and look like it goes in to a loop and hit NPE 
after nodemanager x restart.
 {quote}
2022-03-15 23:35:25,488 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com:8043
2022-03-15 23:35:25,490 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com:8043
2022-03-15 23:35:25,492 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com:8043
2022-03-15 23:35:25,495 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop123.com
2022-03-15 23:35:25,499 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop123.com 
Node Transitioned from NEW to RUNNING
2022-03-15 23:35:25,499 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered 
with capability: , assigned nodeId 
hadoop2375.rz.momo.com:8043
2022-03-15 23:35:25,515 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-15,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1409)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)

{quote}

  was:
When nodemanager x shutdown and look like it goes to a loop and hit NPE after 
nodemanager x restart.
 {quote}
2022-03-15 23:35:25,488 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: 

[jira] [Created] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve

2022-03-16 Thread Bo Li (Jira)
Bo Li created YARN-11091:


 Summary: NPE at FiCaSchedulerApp#findNodeToUnreserve
 Key: YARN-11091
 URL: https://issues.apache.org/jira/browse/YARN-11091
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bo Li


When nodemanager x shutdown and look like it goes to a loop and hit NPE after 
nodemanager x restart.
 {quote}
2022-03-15 23:35:25,488 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop2375.rz.momo.com:8043
2022-03-15 23:35:25,490 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop2375.rz.momo.com:8043
2022-03-15 23:35:25,492 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop2375.rz.momo.com:8043
2022-03-15 23:35:25,495 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: hadoop2375.rz.momo.com:8043
2022-03-15 23:35:25,499 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
hadoop2375.rz.momo.com:8043 Node Transitioned from NEW to RUNNING
2022-03-15 23:35:25,499 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node hadoop2375.rz.momo.com(cmPort: 8043 httpPort: 8042) 
registered with capability: , assigned 
nodeId hadoop2375.rz.momo.com:8043
2022-03-15 23:35:25,515 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-15,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1409)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)

{quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated

2022-03-01 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11082:
-
Description: 
We ued cluster resource as denominator to decide which resoure is dominated in 
AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed 
differently.
{quote}2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
 assignedContainer application attempt=appattempt_1637412555366_1588993_01 
container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=x
2021-12-09 10:24:37,069 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Used resource= exceeded maxResourceLimit of the 
queue =

2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal
{quote}
We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
following code in AbstrctQueue#canAssignToThisQueue still return false
{quote}Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
usedExceptKillable, currentLimitResource)
{quote}
clusterResource = 
usedExceptKillable =  
currentLimitResource = 

currentLimitResource:
memory : 3381248/175117312 = 0.01930847362
vCores : 687/40222 = 0.01708020486

usedExceptKillable:
memory : 3384320/175117312 = 0.01932601615
vCores : 688/40222 = 0.01710506687

DRF will think memory is dominated resource and return false in this scenario

  was:
We ued cluster resource as denominator to decide which resoure is dominated in 
AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed 
differently.
{quote}2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
 assignedContainer application attempt=appattempt_1637412555366_1588993_01 
container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=xx
2021-12-09 10:24:37,069 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Used resource= exceeded maxResourceLimit of the 
queue =

2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal
{quote}
We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
following code in AbstrctQueue#canAssignToThisQueue still return false

{quote}
Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
usedExceptKillable, currentLimitResource)
{quote}

clusterResource = 
usedExceptKillable =  
currentLimitResource = 

currentLimitResource:
memory : 3381248/175117312 = 0.01930847362
vCores : 687/40222 = 0.01708020486

usedExceptKillable:
memory : 3384320/175117312 = 0.01932601615
vCores : 688/40222 = 0.01710506687

DRF will think memory is dominated resource and return false in this scenario


> Use node label reosurce as  denominator to decide which resource is dominated
> -
>
> Key: YARN-11082
> URL: https://issues.apache.org/jira/browse/YARN-11082
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.1
>Reporter: Bo Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.1.1
>
> Attachments: YARN-11082.001.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We ued cluster resource as denominator to decide which resoure is dominated 
> in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are 
> configed differently.
> {quote}2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application 
> attempt=appattempt_1637412555366_1588993_01 container=null 
> queue=root.a.a1.a2 clusterResource= 
> type=RACK_LOCAL requestedPartition=x
> 2021-12-09 10:24:37,069 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> {quote}
> We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
> following code in AbstrctQueue#canAssignToThisQueue still return false
> {quote}Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
> usedExceptKillable, currentLimitResource)
> {quote}
> clusterResource = 
> usedExceptKillable =  
> currentLimitResource = 
> currentLimitResource:
> memory : 3381248/175117312 = 0.01930847362
> vCores : 687/40222 = 

[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated

2022-02-24 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11082:
-
Attachment: YARN-11082.001.patch

> Use node label reosurce as  denominator to decide which resource is dominated
> -
>
> Key: YARN-11082
> URL: https://issues.apache.org/jira/browse/YARN-11082
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.1
>Reporter: Bo Li
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: YARN-11082.001.patch
>
>
> We ued cluster resource as denominator to decide which resoure is dominated 
> in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are 
> configed differently.
> {quote}2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application 
> attempt=appattempt_1637412555366_1588993_01 container=null 
> queue=root.a.a1.a2 clusterResource= 
> type=RACK_LOCAL requestedPartition=xx
> 2021-12-09 10:24:37,069 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> {quote}
> We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
> following code in AbstrctQueue#canAssignToThisQueue still return false
> {quote}
> Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
> usedExceptKillable, currentLimitResource)
> {quote}
> clusterResource = 
> usedExceptKillable =  
> currentLimitResource = 
> currentLimitResource:
> memory : 3381248/175117312 = 0.01930847362
> vCores : 687/40222 = 0.01708020486
> usedExceptKillable:
> memory : 3384320/175117312 = 0.01932601615
> vCores : 688/40222 = 0.01710506687
> DRF will think memory is dominated resource and return false in this scenario



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated

2022-02-24 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11082:
-
Attachment: (was: YARN-11082.patch)

> Use node label reosurce as  denominator to decide which resource is dominated
> -
>
> Key: YARN-11082
> URL: https://issues.apache.org/jira/browse/YARN-11082
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.1
>Reporter: Bo Li
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: YARN-11082.001.patch
>
>
> We ued cluster resource as denominator to decide which resoure is dominated 
> in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are 
> configed differently.
> {quote}2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application 
> attempt=appattempt_1637412555366_1588993_01 container=null 
> queue=root.a.a1.a2 clusterResource= 
> type=RACK_LOCAL requestedPartition=xx
> 2021-12-09 10:24:37,069 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> {quote}
> We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
> following code in AbstrctQueue#canAssignToThisQueue still return false
> {quote}
> Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
> usedExceptKillable, currentLimitResource)
> {quote}
> clusterResource = 
> usedExceptKillable =  
> currentLimitResource = 
> currentLimitResource:
> memory : 3381248/175117312 = 0.01930847362
> vCores : 687/40222 = 0.01708020486
> usedExceptKillable:
> memory : 3384320/175117312 = 0.01932601615
> vCores : 688/40222 = 0.01710506687
> DRF will think memory is dominated resource and return false in this scenario



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated

2022-02-24 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11082:
-
Target Version/s: 3.1.1

> Use node label reosurce as  denominator to decide which resource is dominated
> -
>
> Key: YARN-11082
> URL: https://issues.apache.org/jira/browse/YARN-11082
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.1
>Reporter: Bo Li
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: YARN-11082.patch
>
>
> We ued cluster resource as denominator to decide which resoure is dominated 
> in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are 
> configed differently.
> {quote}2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application 
> attempt=appattempt_1637412555366_1588993_01 container=null 
> queue=root.a.a1.a2 clusterResource= 
> type=RACK_LOCAL requestedPartition=xx
> 2021-12-09 10:24:37,069 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> {quote}
> We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
> following code in AbstrctQueue#canAssignToThisQueue still return false
> {quote}
> Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
> usedExceptKillable, currentLimitResource)
> {quote}
> clusterResource = 
> usedExceptKillable =  
> currentLimitResource = 
> currentLimitResource:
> memory : 3381248/175117312 = 0.01930847362
> vCores : 687/40222 = 0.01708020486
> usedExceptKillable:
> memory : 3384320/175117312 = 0.01932601615
> vCores : 688/40222 = 0.01710506687
> DRF will think memory is dominated resource and return false in this scenario



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated

2022-02-24 Thread Bo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Li updated YARN-11082:
-
Description: 
We ued cluster resource as denominator to decide which resoure is dominated in 
AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed 
differently.
{quote}2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
 assignedContainer application attempt=appattempt_1637412555366_1588993_01 
container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=xx
2021-12-09 10:24:37,069 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Used resource= exceeded maxResourceLimit of the 
queue =

2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal
{quote}
We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
following code in AbstrctQueue#canAssignToThisQueue still return false

{quote}
Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
usedExceptKillable, currentLimitResource)
{quote}

clusterResource = 
usedExceptKillable =  
currentLimitResource = 

currentLimitResource:
memory : 3381248/175117312 = 0.01930847362
vCores : 687/40222 = 0.01708020486

usedExceptKillable:
memory : 3384320/175117312 = 0.01932601615
vCores : 688/40222 = 0.01710506687

DRF will think memory is dominated resource and return false in this scenario

  was:
We ued cluster resource as denominator to decide which resoure is dominated in 
AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed 
differently.
{quote}
2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
 assignedContainer application attempt=appattempt_1637412555366_1588993_01 
container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=xx
2021-12-09 10:24:37,069 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Used resource= exceeded maxResourceLimit of the 
queue =

2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal

{quote}

We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
following code in AbstrctQueue#canAssignToThisQueue still return false
```java
Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
  usedExceptKillable, currentLimitResource)
```
 clusterResource = 
usedExceptKillable =  
currentLimitResource = 

currentLimitResource:
memory : 3381248/175117312 = 0.01930847362
vCores : 687/40222 = 0.01708020486

usedExceptKillable:
memory : 3384320/175117312 = 0.01932601615
vCores : 688/40222 = 0.01710506687

DRF will think memory is dominated resource and return false in this scenario


> Use node label reosurce as  denominator to decide which resource is dominated
> -
>
> Key: YARN-11082
> URL: https://issues.apache.org/jira/browse/YARN-11082
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.1
>Reporter: Bo Li
>Priority: Major
>
> We ued cluster resource as denominator to decide which resoure is dominated 
> in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are 
> configed differently.
> {quote}2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application 
> attempt=appattempt_1637412555366_1588993_01 container=null 
> queue=root.a.a1.a2 clusterResource= 
> type=RACK_LOCAL requestedPartition=xx
> 2021-12-09 10:24:37,069 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> {quote}
> We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
> following code in AbstrctQueue#canAssignToThisQueue still return false
> {quote}
> Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
> usedExceptKillable, currentLimitResource)
> {quote}
> clusterResource = 
> usedExceptKillable =  
> currentLimitResource = 
> currentLimitResource:
> memory : 3381248/175117312 = 0.01930847362
> vCores : 687/40222 = 0.01708020486
> usedExceptKillable:
> memory : 3384320/175117312 = 0.01932601615
> vCores : 688/40222 = 0.01710506687
> DRF will think memory is dominated resource and 

[jira] [Created] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated

2022-02-24 Thread Bo Li (Jira)
Bo Li created YARN-11082:


 Summary: Use node label reosurce as  denominator to decide which 
resource is dominated
 Key: YARN-11082
 URL: https://issues.apache.org/jira/browse/YARN-11082
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 3.1.1
Reporter: Bo Li


We ued cluster resource as denominator to decide which resoure is dominated in 
AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed 
differently.
{quote}
2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
 assignedContainer application attempt=appattempt_1637412555366_1588993_01 
container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=xx
2021-12-09 10:24:37,069 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Used resource= exceeded maxResourceLimit of the 
queue =

2021-12-09 10:24:37,069 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal

{quote}

We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
following code in AbstrctQueue#canAssignToThisQueue still return false
```java
Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
  usedExceptKillable, currentLimitResource)
```
 clusterResource = 
usedExceptKillable =  
currentLimitResource = 

currentLimitResource:
memory : 3381248/175117312 = 0.01930847362
vCores : 687/40222 = 0.01708020486

usedExceptKillable:
memory : 3384320/175117312 = 0.01932601615
vCores : 688/40222 = 0.01710506687

DRF will think memory is dominated resource and return false in this scenario



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org