Syed Shameerur Rahman created YARN-11697:
--------------------------------------------

             Summary: Fix fair scheduler race condition in 
removeApplicationAttempt and moveApplication
                 Key: YARN-11697
                 URL: https://issues.apache.org/jira/browse/YARN-11697
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.2.1
            Reporter: Syed Shameerur Rahman
            Assignee: Syed Shameerur Rahman


For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with the 
following exception
{code:java}
2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
(SchedulerEventDispatcher:Event Processor): Error in handling event type 
APP_ATTEMPT_REMOVED to the Event Dispatcher
java.lang.IllegalStateException: Given app to remove 
appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not 
exist in queue [root.tier2.livy, demand=<memory:10826752, vCores:2101>, 
running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0]
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
        at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
        at java.lang.Thread.run(Thread.java:750)
{code}
The exception seems similar to the one mentioned in YARN-5136, but it looks 
like there is still some edge cases not covered by YARN-5136.

1. On deeper look, i could see that as mentioned in the comment here. if a call 
for a moveApplication and removeApplicationAttempt for the same attempt are 
processed in short succession the application attempt will still contain a 
queue reference but is already removed from the list of applications for the 
queue.

2. This can happen when 
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
 removes the appAttempt from the queue and 
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
 also tries to remove the same appAttempt from the queue.

3. On further checking, i could see that before doing 
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
 writeLock on appAttempt is taken where as for 
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
 , i don't see any writelock being taken which can result in race condition if 
same appAttempt is being processed.

4. Additionally as mentioned in the comment here when such scenario occurs 
ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to