[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848046#comment-17848046
 ] 

Syed Shameerur Rahman commented on YARN-11697:
----------------------------------------------

[~wilfreds]  any thoughts on this ?

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-11697
>                 URL: https://issues.apache.org/jira/browse/YARN-11697
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.1
>            Reporter: Syed Shameerur Rahman
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not 
> exist in queue [root.tier2.livy, demand=<memory:10826752, vCores:2101>, 
> running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0]
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to