[ https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848046#comment-17848046 ]
Syed Shameerur Rahman commented on YARN-11697: ---------------------------------------------- [~wilfreds] any thoughts on this ? > Fix fair scheduler race condition in removeApplicationAttempt and > moveApplication > --------------------------------------------------------------------------------- > > Key: YARN-11697 > URL: https://issues.apache.org/jira/browse/YARN-11697 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.2.1 > Reporter: Syed Shameerur Rahman > Assignee: Syed Shameerur Rahman > Priority: Major > > For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with > the following exception > {code:java} > 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > APP_ATTEMPT_REMOVED to the Event Dispatcher > java.lang.IllegalStateException: Given app to remove > appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not > exist in queue [root.tier2.livy, demand=<memory:10826752, vCores:2101>, > running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:750) > {code} > The exception seems similar to the one mentioned in YARN-5136, but it looks > like there is still some edge cases not covered by YARN-5136. > 1. On deeper look, i could see that as mentioned in the comment here. if a > call for a moveApplication and removeApplicationAttempt for the same attempt > are processed in short succession the application attempt will still contain > a queue reference but is already removed from the list of applications for > the queue. > 2. This can happen when > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908] > removes the appAttempt from the queue and > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707] > also tries to remove the same appAttempt from the queue. > 3. On further checking, i could see that before doing > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779] > writeLock on appAttempt is taken where as for > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665] > , i don't see any writelock being taken which can result in race condition > if same appAttempt is being processed. > 4. Additionally as mentioned in the comment here when such scenario occurs > ideally we should not take down RM. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org