[jira] [Resolved] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve
[ https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li resolved YARN-11091. -- Resolution: Duplicate > NPE at FiCaSchedulerApp#findNodeToUnreserve > --- > > Key: YARN-11091 > URL: https://issues.apache.org/jira/browse/YARN-11091 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.1 >Reporter: Bo Li >Priority: Critical > > When nodemanager hadoop123 shutdown and it looks like goes in to a loop and > hit NPE after hadoop123 back to work since FiCaSchedulerNode is not null > anymore. > {quote} > 2022-03-15 23:35:25,488 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com:8043 > 2022-03-15 23:35:25,490 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com:8043 > 2022-03-15 23:35:25,492 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com:8043 > 2022-03-15 23:35:25,495 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com > 2022-03-15 23:35:25,499 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > hadoop123.com Node Transitioned from NEW to RUNNING > 2022-03-15 23:35:25,499 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered > with capability: , assigned nodeId > hadoop123.com:8043 > 2022-03-15 23:35:25,515 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-15,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660) >
[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve
[ https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11091: - Description: When nodemanager hadoop123 shutdown and it looks like goes in to a loop and hit NPE after hadoop123 back to work since FiCaSchedulerNode is not null anymore. {quote} 2022-03-15 23:35:25,488 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com:8043 2022-03-15 23:35:25,490 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com:8043 2022-03-15 23:35:25,492 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com:8043 2022-03-15 23:35:25,495 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com 2022-03-15 23:35:25,499 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop123.com Node Transitioned from NEW to RUNNING 2022-03-15 23:35:25,499 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered with capability: , assigned nodeId hadoop123.com:8043 2022-03-15 23:35:25,515 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-15,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1409) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) {quote} was: When nodemanager hadoop123 shutdown and it looks like goes in to a loop and hit NPE after nodemanager x restart since FiCaSchedulerNode is not null anymore. {quote} 2022-03-15 23:35:25,488 ERROR
[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve
[ https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11091: - Component/s: capacity scheduler > NPE at FiCaSchedulerApp#findNodeToUnreserve > --- > > Key: YARN-11091 > URL: https://issues.apache.org/jira/browse/YARN-11091 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.1 >Reporter: Bo Li >Priority: Critical > > When nodemanager hadoop123 shutdown and it looks like goes in to a loop and > hit NPE after nodemanager x restart since FiCaSchedulerNode is not null > anymore. > {quote} > 2022-03-15 23:35:25,488 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com:8043 > 2022-03-15 23:35:25,490 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com:8043 > 2022-03-15 23:35:25,492 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com:8043 > 2022-03-15 23:35:25,495 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com > 2022-03-15 23:35:25,499 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > hadoop123.com Node Transitioned from NEW to RUNNING > 2022-03-15 23:35:25,499 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered > with capability: , assigned nodeId > hadoop123.com:8043 > 2022-03-15 23:35:25,515 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-15,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560) > at >
[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve
[ https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11091: - Description: When nodemanager hadoop123 shutdown and it looks like goes in to a loop and hit NPE after nodemanager x restart since FiCaSchedulerNode is not null anymore. {quote} 2022-03-15 23:35:25,488 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com:8043 2022-03-15 23:35:25,490 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com:8043 2022-03-15 23:35:25,492 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com:8043 2022-03-15 23:35:25,495 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com 2022-03-15 23:35:25,499 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop123.com Node Transitioned from NEW to RUNNING 2022-03-15 23:35:25,499 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered with capability: , assigned nodeId hadoop123.com:8043 2022-03-15 23:35:25,515 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-15,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1409) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) {quote} was: When nodemanager x shutdown and look like it goes in to a loop and hit NPE after nodemanager x restart. {quote} 2022-03-15 23:35:25,488 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to
[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve
[ https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11091: - Affects Version/s: 3.1.1 > NPE at FiCaSchedulerApp#findNodeToUnreserve > --- > > Key: YARN-11091 > URL: https://issues.apache.org/jira/browse/YARN-11091 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Bo Li >Priority: Critical > > When nodemanager hadoop123 shutdown and it looks like goes in to a loop and > hit NPE after nodemanager x restart since FiCaSchedulerNode is not null > anymore. > {quote} > 2022-03-15 23:35:25,488 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com:8043 > 2022-03-15 23:35:25,490 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com:8043 > 2022-03-15 23:35:25,492 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com:8043 > 2022-03-15 23:35:25,495 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: hadoop123.com > 2022-03-15 23:35:25,499 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > hadoop123.com Node Transitioned from NEW to RUNNING > 2022-03-15 23:35:25,499 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered > with capability: , assigned nodeId > hadoop123.com:8043 > 2022-03-15 23:35:25,515 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-15,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660) > at >
[jira] [Updated] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve
[ https://issues.apache.org/jira/browse/YARN-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11091: - Description: When nodemanager x shutdown and look like it goes in to a loop and hit NPE after nodemanager x restart. {quote} 2022-03-15 23:35:25,488 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com:8043 2022-03-15 23:35:25,490 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com:8043 2022-03-15 23:35:25,492 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com:8043 2022-03-15 23:35:25,495 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop123.com 2022-03-15 23:35:25,499 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop123.com Node Transitioned from NEW to RUNNING 2022-03-15 23:35:25,499 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node hadoop123.com(cmPort: 8043 httpPort: 8042) registered with capability: , assigned nodeId hadoop2375.rz.momo.com:8043 2022-03-15 23:35:25,515 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-15,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1409) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) {quote} was: When nodemanager x shutdown and look like it goes to a loop and hit NPE after nodemanager x restart. {quote} 2022-03-15 23:35:25,488 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid:
[jira] [Created] (YARN-11091) NPE at FiCaSchedulerApp#findNodeToUnreserve
Bo Li created YARN-11091: Summary: NPE at FiCaSchedulerApp#findNodeToUnreserve Key: YARN-11091 URL: https://issues.apache.org/jira/browse/YARN-11091 Project: Hadoop YARN Issue Type: Bug Reporter: Bo Li When nodemanager x shutdown and look like it goes to a loop and hit NPE after nodemanager x restart. {quote} 2022-03-15 23:35:25,488 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop2375.rz.momo.com:8043 2022-03-15 23:35:25,490 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop2375.rz.momo.com:8043 2022-03-15 23:35:25,492 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop2375.rz.momo.com:8043 2022-03-15 23:35:25,495 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: hadoop2375.rz.momo.com:8043 2022-03-15 23:35:25,499 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop2375.rz.momo.com:8043 Node Transitioned from NEW to RUNNING 2022-03-15 23:35:25,499 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node hadoop2375.rz.momo.com(cmPort: 8043 httpPort: 8042) registered with capability: , assigned nodeId hadoop2375.rz.momo.com:8043 2022-03-15 23:35:25,515 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-15,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerApp.java:905) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainer(RegularContainerAllocator.java:587) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:400) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:480) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:258) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:845) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:883) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:927) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1560) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1660) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1409) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) {quote} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated
[ https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11082: - Description: We ued cluster resource as denominator to decide which resoure is dominated in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed differently. {quote}2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: assignedContainer application attempt=appattempt_1637412555366_1588993_01 container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=x 2021-12-09 10:24:37,069 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource= exceeded maxResourceLimit of the queue = 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal {quote} We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the following code in AbstrctQueue#canAssignToThisQueue still return false {quote}Resources.greaterThanOrEqual(resourceCalculator, clusterResource, usedExceptKillable, currentLimitResource) {quote} clusterResource = usedExceptKillable = currentLimitResource = currentLimitResource: memory : 3381248/175117312 = 0.01930847362 vCores : 687/40222 = 0.01708020486 usedExceptKillable: memory : 3384320/175117312 = 0.01932601615 vCores : 688/40222 = 0.01710506687 DRF will think memory is dominated resource and return false in this scenario was: We ued cluster resource as denominator to decide which resoure is dominated in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed differently. {quote}2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: assignedContainer application attempt=appattempt_1637412555366_1588993_01 container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=xx 2021-12-09 10:24:37,069 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource= exceeded maxResourceLimit of the queue = 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal {quote} We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the following code in AbstrctQueue#canAssignToThisQueue still return false {quote} Resources.greaterThanOrEqual(resourceCalculator, clusterResource, usedExceptKillable, currentLimitResource) {quote} clusterResource = usedExceptKillable = currentLimitResource = currentLimitResource: memory : 3381248/175117312 = 0.01930847362 vCores : 687/40222 = 0.01708020486 usedExceptKillable: memory : 3384320/175117312 = 0.01932601615 vCores : 688/40222 = 0.01710506687 DRF will think memory is dominated resource and return false in this scenario > Use node label reosurce as denominator to decide which resource is dominated > - > > Key: YARN-11082 > URL: https://issues.apache.org/jira/browse/YARN-11082 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.1 >Reporter: Bo Li >Priority: Major > Labels: pull-request-available > Fix For: 3.1.1 > > Attachments: YARN-11082.001.patch > > Time Spent: 10m > Remaining Estimate: 0h > > We ued cluster resource as denominator to decide which resoure is dominated > in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are > configed differently. > {quote}2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application > attempt=appattempt_1637412555366_1588993_01 container=null > queue=root.a.a1.a2 clusterResource= > type=RACK_LOCAL requestedPartition=x > 2021-12-09 10:24:37,069 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: > Used resource= exceeded maxResourceLimit of the > queue = > 2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > {quote} > We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the > following code in AbstrctQueue#canAssignToThisQueue still return false > {quote}Resources.greaterThanOrEqual(resourceCalculator, clusterResource, > usedExceptKillable, currentLimitResource) > {quote} > clusterResource = > usedExceptKillable = > currentLimitResource = > currentLimitResource: > memory : 3381248/175117312 = 0.01930847362 > vCores : 687/40222 =
[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated
[ https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11082: - Attachment: YARN-11082.001.patch > Use node label reosurce as denominator to decide which resource is dominated > - > > Key: YARN-11082 > URL: https://issues.apache.org/jira/browse/YARN-11082 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.1 >Reporter: Bo Li >Priority: Major > Fix For: 3.1.1 > > Attachments: YARN-11082.001.patch > > > We ued cluster resource as denominator to decide which resoure is dominated > in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are > configed differently. > {quote}2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application > attempt=appattempt_1637412555366_1588993_01 container=null > queue=root.a.a1.a2 clusterResource= > type=RACK_LOCAL requestedPartition=xx > 2021-12-09 10:24:37,069 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: > Used resource= exceeded maxResourceLimit of the > queue = > 2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > {quote} > We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the > following code in AbstrctQueue#canAssignToThisQueue still return false > {quote} > Resources.greaterThanOrEqual(resourceCalculator, clusterResource, > usedExceptKillable, currentLimitResource) > {quote} > clusterResource = > usedExceptKillable = > currentLimitResource = > currentLimitResource: > memory : 3381248/175117312 = 0.01930847362 > vCores : 687/40222 = 0.01708020486 > usedExceptKillable: > memory : 3384320/175117312 = 0.01932601615 > vCores : 688/40222 = 0.01710506687 > DRF will think memory is dominated resource and return false in this scenario -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated
[ https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11082: - Attachment: (was: YARN-11082.patch) > Use node label reosurce as denominator to decide which resource is dominated > - > > Key: YARN-11082 > URL: https://issues.apache.org/jira/browse/YARN-11082 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.1 >Reporter: Bo Li >Priority: Major > Fix For: 3.1.1 > > Attachments: YARN-11082.001.patch > > > We ued cluster resource as denominator to decide which resoure is dominated > in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are > configed differently. > {quote}2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application > attempt=appattempt_1637412555366_1588993_01 container=null > queue=root.a.a1.a2 clusterResource= > type=RACK_LOCAL requestedPartition=xx > 2021-12-09 10:24:37,069 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: > Used resource= exceeded maxResourceLimit of the > queue = > 2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > {quote} > We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the > following code in AbstrctQueue#canAssignToThisQueue still return false > {quote} > Resources.greaterThanOrEqual(resourceCalculator, clusterResource, > usedExceptKillable, currentLimitResource) > {quote} > clusterResource = > usedExceptKillable = > currentLimitResource = > currentLimitResource: > memory : 3381248/175117312 = 0.01930847362 > vCores : 687/40222 = 0.01708020486 > usedExceptKillable: > memory : 3384320/175117312 = 0.01932601615 > vCores : 688/40222 = 0.01710506687 > DRF will think memory is dominated resource and return false in this scenario -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated
[ https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11082: - Target Version/s: 3.1.1 > Use node label reosurce as denominator to decide which resource is dominated > - > > Key: YARN-11082 > URL: https://issues.apache.org/jira/browse/YARN-11082 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.1 >Reporter: Bo Li >Priority: Major > Fix For: 3.1.1 > > Attachments: YARN-11082.patch > > > We ued cluster resource as denominator to decide which resoure is dominated > in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are > configed differently. > {quote}2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application > attempt=appattempt_1637412555366_1588993_01 container=null > queue=root.a.a1.a2 clusterResource= > type=RACK_LOCAL requestedPartition=xx > 2021-12-09 10:24:37,069 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: > Used resource= exceeded maxResourceLimit of the > queue = > 2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > {quote} > We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the > following code in AbstrctQueue#canAssignToThisQueue still return false > {quote} > Resources.greaterThanOrEqual(resourceCalculator, clusterResource, > usedExceptKillable, currentLimitResource) > {quote} > clusterResource = > usedExceptKillable = > currentLimitResource = > currentLimitResource: > memory : 3381248/175117312 = 0.01930847362 > vCores : 687/40222 = 0.01708020486 > usedExceptKillable: > memory : 3384320/175117312 = 0.01932601615 > vCores : 688/40222 = 0.01710506687 > DRF will think memory is dominated resource and return false in this scenario -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated
[ https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Li updated YARN-11082: - Description: We ued cluster resource as denominator to decide which resoure is dominated in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed differently. {quote}2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: assignedContainer application attempt=appattempt_1637412555366_1588993_01 container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=xx 2021-12-09 10:24:37,069 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource= exceeded maxResourceLimit of the queue = 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal {quote} We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the following code in AbstrctQueue#canAssignToThisQueue still return false {quote} Resources.greaterThanOrEqual(resourceCalculator, clusterResource, usedExceptKillable, currentLimitResource) {quote} clusterResource = usedExceptKillable = currentLimitResource = currentLimitResource: memory : 3381248/175117312 = 0.01930847362 vCores : 687/40222 = 0.01708020486 usedExceptKillable: memory : 3384320/175117312 = 0.01932601615 vCores : 688/40222 = 0.01710506687 DRF will think memory is dominated resource and return false in this scenario was: We ued cluster resource as denominator to decide which resoure is dominated in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed differently. {quote} 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: assignedContainer application attempt=appattempt_1637412555366_1588993_01 container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=xx 2021-12-09 10:24:37,069 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource= exceeded maxResourceLimit of the queue = 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal {quote} We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the following code in AbstrctQueue#canAssignToThisQueue still return false ```java Resources.greaterThanOrEqual(resourceCalculator, clusterResource, usedExceptKillable, currentLimitResource) ``` clusterResource = usedExceptKillable = currentLimitResource = currentLimitResource: memory : 3381248/175117312 = 0.01930847362 vCores : 687/40222 = 0.01708020486 usedExceptKillable: memory : 3384320/175117312 = 0.01932601615 vCores : 688/40222 = 0.01710506687 DRF will think memory is dominated resource and return false in this scenario > Use node label reosurce as denominator to decide which resource is dominated > - > > Key: YARN-11082 > URL: https://issues.apache.org/jira/browse/YARN-11082 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.1 >Reporter: Bo Li >Priority: Major > > We ued cluster resource as denominator to decide which resoure is dominated > in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are > configed differently. > {quote}2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application > attempt=appattempt_1637412555366_1588993_01 container=null > queue=root.a.a1.a2 clusterResource= > type=RACK_LOCAL requestedPartition=xx > 2021-12-09 10:24:37,069 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: > Used resource= exceeded maxResourceLimit of the > queue = > 2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > {quote} > We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the > following code in AbstrctQueue#canAssignToThisQueue still return false > {quote} > Resources.greaterThanOrEqual(resourceCalculator, clusterResource, > usedExceptKillable, currentLimitResource) > {quote} > clusterResource = > usedExceptKillable = > currentLimitResource = > currentLimitResource: > memory : 3381248/175117312 = 0.01930847362 > vCores : 687/40222 = 0.01708020486 > usedExceptKillable: > memory : 3384320/175117312 = 0.01932601615 > vCores : 688/40222 = 0.01710506687 > DRF will think memory is dominated resource and
[jira] [Created] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated
Bo Li created YARN-11082: Summary: Use node label reosurce as denominator to decide which resource is dominated Key: YARN-11082 URL: https://issues.apache.org/jira/browse/YARN-11082 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 3.1.1 Reporter: Bo Li We ued cluster resource as denominator to decide which resoure is dominated in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed differently. {quote} 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: assignedContainer application attempt=appattempt_1637412555366_1588993_01 container=null queue=root.a.a1.a2 clusterResource= type=RACK_LOCAL requestedPartition=xx 2021-12-09 10:24:37,069 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource= exceeded maxResourceLimit of the queue = 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal {quote} We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the following code in AbstrctQueue#canAssignToThisQueue still return false ```java Resources.greaterThanOrEqual(resourceCalculator, clusterResource, usedExceptKillable, currentLimitResource) ``` clusterResource = usedExceptKillable = currentLimitResource = currentLimitResource: memory : 3381248/175117312 = 0.01930847362 vCores : 687/40222 = 0.01708020486 usedExceptKillable: memory : 3384320/175117312 = 0.01932601615 vCores : 688/40222 = 0.01710506687 DRF will think memory is dominated resource and return false in this scenario -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org