[jira] [Resolved] (YARN-11834) [Capacity Scheduler] Application Stuck In ACCEPTED State due to Race Condition

Shilun Fan (Jira) Wed, 23 Jul 2025 06:27:33 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-11834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shilun Fan resolved YARN-11834.
-------------------------------
    Fix Version/s: 3.5.0
     Hadoop Flags: Reviewed
       Resolution: Fixed

> [Capacity Scheduler] Application Stuck In ACCEPTED State due to Race Condition
> ------------------------------------------------------------------------------
>
>                 Key: YARN-11834
>                 URL: https://issues.apache.org/jira/browse/YARN-11834
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 3.4.0, 3.4.1
>            Reporter: Syed Shameerur Rahman
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.0
>
>
> It was noted that in a Hadoop 3.4.1 YARN deployment, Spark application was 
> stuck in ACCEPTED state even though the cluster had enough resources.
>  
> *Steps to replicate*
> 1. Launch YARN cluster total capacity ≥ 1.59 TB memory, 660 vCores or more 
> {{{}2.Apply the following properties{}}}{*}{{*}}
> *{{capacity-scheduler}}*
> *{{{}"yarn.scheduler.capacity.node-locality-delay": "-1", 
> "yarn.scheduler.capacity.resource-calculator": 
> "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"{}}}{{{},{}}}*
> {*}{{"}}{*}{*}{{{}}{}}}}{*}{{{}*{\{{}yarn.scheduler.capacity.schedule-asynchronously.enable*{}}}{*}{{"
>  : "true"}}{*}
>  
> *{{yarn-site}}*
> *{{"yarn.log-aggregation-enable": "true",}}*
> *{{{}"yarn.log-aggregation.retain-check-interval-seconds": "300", 
> "yarn.log-aggregation.retain-seconds": "-1", 
> "yarn.scheduler.capacity.max-parallel-apps": "1"{}}}{{{{}}{}}}*
> 3. Submit multiple Spark jobs that launch a large number of containers. For 
> example:
> {{spark-example --conf spark.dynamicAllocation.enabled=false --num-executors 
> 2000 --driver-memory 1g --executor-memory 1g --executor-cores 1 SparkPi 1000}}
>  
> *Observations*
> On analysis the logs, The following were the observations :
> When Application 1 completes, there's a period where its resource requests 
> are still being processed or "honored" by the scheduler. During this 
> transition period, the following sequence could occur:
> 1. Application 1 completes and releases its resources
> 2. The scheduler is still processing some older allocation requests for 
> Application 1
> 3. During this processing, the *cul.canAssign flag* for the user is set to 
> false. Refer [Link#1 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1670]and
>  [Link 
> #2|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1268]
> 4. Application 2 (which is new) tries to get resources
> 5. The scheduler checks the user's cul.canAssign flag, finds it's false (due 
> to [cache 
> implementation|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1241]),
>  and denies resources to Application 2
> 6. Application 2 remains in ACCEPTED state despite available resources
> This race condition occurs because the user's resource usage state (tracked 
> in the CapacityUsageLimit object) isn't properly reset or synchronized  
> between the completion of one application and the scheduling of another.
>  
> *Solutions*
> I can think of two solution for this race condition
>  # *Cache Invalidation* : Invalidate the cache when no user information is 
> fetched [here 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1669]by
>  doing this the new application (by the same user) will be forced to 
> calculate new userLimits. The problem with this problem is repeated 
> calculation of userLimits
>  # *Skip setting cul.canAssign flag :* In this approach setting of 
> cul.canAssign flag can be ignored if the application is already completed / 
> removed from the applicationAttempt list - Refer 
> [this|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1267]
>  code pointer
>  
> I am personally inclined to approach 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (YARN-11834) [Capacity Scheduler] Application Stuck In ACCEPTED State due to Race Condition

Reply via email to