Syed Shameerur Rahman created YARN-11697: --------------------------------------------
Summary: Fix fair scheduler race condition in removeApplicationAttempt and moveApplication Key: YARN-11697 URL: https://issues.apache.org/jira/browse/YARN-11697 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.2.1 Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with the following exception {code:java} 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher (SchedulerEventDispatcher:Event Processor): Error in handling event type APP_ATTEMPT_REMOVED to the Event Dispatcher java.lang.IllegalStateException: Given app to remove appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not exist in queue [root.tier2.livy, demand=<memory:10826752, vCores:2101>, running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0] at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:750) {code} The exception seems similar to the one mentioned in YARN-5136, but it looks like there is still some edge cases not covered by YARN-5136. 1. On deeper look, i could see that as mentioned in the comment here. if a call for a moveApplication and removeApplicationAttempt for the same attempt are processed in short succession the application attempt will still contain a queue reference but is already removed from the list of applications for the queue. 2. This can happen when [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908] removes the appAttempt from the queue and [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707] also tries to remove the same appAttempt from the queue. 3. On further checking, i could see that before doing [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779] writeLock on appAttempt is taken where as for [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665] , i don't see any writelock being taken which can result in race condition if same appAttempt is being processed. 4. Additionally as mentioned in the comment here when such scenario occurs ideally we should not take down RM. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org