[ 
https://issues.apache.org/jira/browse/YARN-11834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007767#comment-18007767
 ] 

ASF GitHub Bot commented on YARN-11834:
---------------------------------------

hadoop-yetus commented on PR #7806:
URL: https://github.com/apache/hadoop/pull/7806#issuecomment-3083429246

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  29m  6s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  trunk passed with JDK 
Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_452-8u452-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 32s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 38s |  |  trunk passed with JDK 
Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  trunk passed with JDK 
Private Build-1.8.0_452-8u452-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m  9s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 19s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  the patch passed with JDK 
Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_452-8u452-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   0m 25s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 21s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 27s |  |  the patch passed with JDK 
Private Build-1.8.0_452-8u452-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m  9s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  | 113m 11s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 26s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 194m 21s |  |  |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.51 ServerAPI=1.51 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7806/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7806 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux cdbe8247f6ca 5.15.0-142-generic #152-Ubuntu SMP Mon May 19 
10:54:31 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 955c012985b55eef92715234fad807d169af9ae3 |
   | Default Java | Private Build-1.8.0_452-8u452-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_452-8u452-ga~us1-0ubuntu1~20.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7806/2/testReport/ |
   | Max. process+thread count | 937 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7806/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> [Capacity Scheduler] Application Stuck In ACCEPTED State due to Race Condition
> ------------------------------------------------------------------------------
>
>                 Key: YARN-11834
>                 URL: https://issues.apache.org/jira/browse/YARN-11834
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 3.4.0, 3.4.1
>            Reporter: Syed Shameerur Rahman
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>              Labels: pull-request-available
>
> It was noted that in a Hadoop 3.4.1 YARN deployment, Spark application was 
> stuck in ACCEPTED state even though the cluster had enough resources.
>  
> *Steps to replicate*
> 1. Launch YARN cluster total capacity ≥ 1.59 TB memory, 660 vCores or more 
> {{{}2.Apply the following properties{}}}{*}{{*}}
> *{{capacity-scheduler}}*
> *{{{}"yarn.scheduler.capacity.node-locality-delay": "-1", 
> "yarn.scheduler.capacity.resource-calculator": 
> "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"{}}}{{{},{}}}*
> {*}{{"}}{*}{*}{{{}}{}}}}{*}{{{}*{\{{}yarn.scheduler.capacity.schedule-asynchronously.enable*{}}}{*}{{"
>  : "true"}}{*}
>  
> *{{yarn-site}}*
> *{{"yarn.log-aggregation-enable": "true",}}*
> *{{{}"yarn.log-aggregation.retain-check-interval-seconds": "300", 
> "yarn.log-aggregation.retain-seconds": "-1", 
> "yarn.scheduler.capacity.max-parallel-apps": "1"{}}}{{{{}}{}}}*
> 3. Submit multiple Spark jobs that launch a large number of containers. For 
> example:
> {{spark-example --conf spark.dynamicAllocation.enabled=false --num-executors 
> 2000 --driver-memory 1g --executor-memory 1g --executor-cores 1 SparkPi 1000}}
>  
> *Observations*
> On analysis the logs, The following were the observations :
> When Application 1 completes, there's a period where its resource requests 
> are still being processed or "honored" by the scheduler. During this 
> transition period, the following sequence could occur:
> 1. Application 1 completes and releases its resources
> 2. The scheduler is still processing some older allocation requests for 
> Application 1
> 3. During this processing, the *cul.canAssign flag* for the user is set to 
> false. Refer [Link#1 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1670]and
>  [Link 
> #2|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1268]
> 4. Application 2 (which is new) tries to get resources
> 5. The scheduler checks the user's cul.canAssign flag, finds it's false (due 
> to [cache 
> implementation|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1241]),
>  and denies resources to Application 2
> 6. Application 2 remains in ACCEPTED state despite available resources
> This race condition occurs because the user's resource usage state (tracked 
> in the CapacityUsageLimit object) isn't properly reset or synchronized  
> between the completion of one application and the scheduling of another.
>  
> *Solutions*
> I can think of two solution for this race condition
>  # *Cache Invalidation* : Invalidate the cache when no user information is 
> fetched [here 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1669]by
>  doing this the new application (by the same user) will be forced to 
> calculate new userLimits. The problem with this problem is repeated 
> calculation of userLimits
>  # *Skip setting cul.canAssign flag :* In this approach setting of 
> cul.canAssign flag can be ignored if the application is already completed / 
> removed from the applicationAttempt list - Refer 
> [this|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1267]
>  code pointer
>  
> I am personally inclined to approach 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to