[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shilun Fan resolved YARN-11702. ------------------------------- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.5.0 Resolution: Fixed > Fix Yarn over allocating containers > ----------------------------------- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn > Reporter: Syed Shameerur Rahman > Assignee: Syed Shameerur Rahman > Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: <memory:1408, max memory:2147483647, vCores:1, max > vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 14 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 13 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,327 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,328 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,362 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,363 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,363 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,363 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,363 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,364 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_0000012024-06-24 14:10:14,454 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 35 on default port 8030): Updates PendingContainers: 4 > Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,454 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 35 on default port 8030): Updates PendingContainers: 0 > Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,578 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,579 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,614 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,614 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,614 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,615 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,615 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,615 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,829 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,829 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,864 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,864 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_0000012024-06-24 14:10:14,874 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:15,080 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:15,081 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:15,115 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:15,115 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:17,931 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 3 on default port 8030): checking for deactivate of > application :application_1719234929152_0004 > 2024-06-24 14:10:20,743 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Application > application_1719234929152_0004 requests cleared {code} > *RCA* > It looks like the issue is happening due to asynchronous nature of AMRMClient > response communication. The newly allocated containers from Yarn is send as > part of the next request from the client causing additional additional. > Please check the following example. As per the above mentioned logs > # At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn, > since this is the first request, yarn had not allocated any containers and > send 0 containers as response. > # At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn > which suggests that before this time there was an empty request ask to Yarn > which had returned 8 containers as allocated in reponse. For this request was > well the Yarn returned 8 container as allocated in response. > # At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn > for which it returns 12 containers allocated (based on previous request) as > response. > # Since all the container request is exhausted, AM sends empty request to > yarn which returns 4 containers allocated (based on previous request) > # So in total 32 containers were allocated for 20 container request. The > problem here is AM (client) is being notified about the allocated container > request in the next heartbeat/request causing inconsistency. > > *Solutions* > *1. Modify the AM request in the Yarn/RM side* > Normalize/update the AM container request to consider the allocated > containers. This can be done in the respective scheduler side i.e > newContainerAsk = AMContainerAsk - AllocatedContainers. > *2. Make AM aware of allocated container* > Before client making allocate request, Check AM should check with RM about > container allocated. This approach requires additional AM-RM client > communication which can be expense when there are large number of allocate > requests. > > *PS: I could see similar behavior with Apache Tez and could happen with the > latest hadoop-3.4.0 version as well.* > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org