[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication
[ https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848123#comment-17848123 ] Wilfred Spiegelenburg commented on YARN-11697: -- You need to figure out why you get two remove events in a row for the same application. This code has not change in multiple years. If this was really a big issue we should have seen this happen more often and years ago. Try to reproduce without the backports and see if it still happens. You might have backported things that are not compatible that cause side effects. > Fix fair scheduler race condition in removeApplicationAttempt and > moveApplication > - > > Key: YARN-11697 > URL: https://issues.apache.org/jira/browse/YARN-11697 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > > For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with > the following exception > {code:java} > 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > APP_ATTEMPT_REMOVED to the Event Dispatcher > java.lang.IllegalStateException: Given app to remove > appattempt_1706879498319_86660_01 Alloc: does not > exist in queue [root, demand=, > running=, share=, w=1.0] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:750) > {code} > The exception seems similar to the one mentioned in YARN-5136, but it looks > like there is still some edge cases not covered by YARN-5136. > 1. On deeper look, i could see that as mentioned in the comment here. if a > call for a moveApplication and removeApplicationAttempt for the same attempt > are processed in short succession the application attempt will still contain > a queue reference but is already removed from the list of applications for > the queue. > 2. This can happen when > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908] > removes the appAttempt from the queue and > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707] > also tries to remove the same appAttempt from the queue. > 3. On further checking, i could see that before doing > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779] > writeLock on appAttempt is taken where as for > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665] > , i don't see any writelock being taken which can result in race condition > if same appAttempt is being processed. > 4. Additionally as mentioned in the comment here when such scenario occurs > ideally we should not take down RM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication
[ https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848119#comment-17848119 ] Syed Shameerur Rahman commented on YARN-11697: -- [~wilfreds] # IMHO, when the appAttempt is not available in the queue to be removed, It should be handled more gracefully than throwing IllegalStateException which will take down the RM. # Since the appAttempt is anyhow not available in the queue we can safely throw warning message instead of throwing exception Any thoughts on the above approach ? > Fix fair scheduler race condition in removeApplicationAttempt and > moveApplication > - > > Key: YARN-11697 > URL: https://issues.apache.org/jira/browse/YARN-11697 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > > For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with > the following exception > {code:java} > 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > APP_ATTEMPT_REMOVED to the Event Dispatcher > java.lang.IllegalStateException: Given app to remove > appattempt_1706879498319_86660_01 Alloc: does not > exist in queue [root, demand=, > running=, share=, w=1.0] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:750) > {code} > The exception seems similar to the one mentioned in YARN-5136, but it looks > like there is still some edge cases not covered by YARN-5136. > 1. On deeper look, i could see that as mentioned in the comment here. if a > call for a moveApplication and removeApplicationAttempt for the same attempt > are processed in short succession the application attempt will still contain > a queue reference but is already removed from the list of applications for > the queue. > 2. This can happen when > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908] > removes the appAttempt from the queue and > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707] > also tries to remove the same appAttempt from the queue. > 3. On further checking, i could see that before doing > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779] > writeLock on appAttempt is taken where as for > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665] > , i don't see any writelock being taken which can result in race condition > if same appAttempt is being processed. > 4. Additionally as mentioned in the comment here when such scenario occurs > ideally we should not take down RM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication
[ https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848086#comment-17848086 ] Syed Shameerur Rahman commented on YARN-11697: -- Additionally i could specifically see this when Application is being killed. > Fix fair scheduler race condition in removeApplicationAttempt and > moveApplication > - > > Key: YARN-11697 > URL: https://issues.apache.org/jira/browse/YARN-11697 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > > For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with > the following exception > {code:java} > 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > APP_ATTEMPT_REMOVED to the Event Dispatcher > java.lang.IllegalStateException: Given app to remove > appattempt_1706879498319_86660_01 Alloc: does not > exist in queue [root, demand=, > running=, share=, w=1.0] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:750) > {code} > The exception seems similar to the one mentioned in YARN-5136, but it looks > like there is still some edge cases not covered by YARN-5136. > 1. On deeper look, i could see that as mentioned in the comment here. if a > call for a moveApplication and removeApplicationAttempt for the same attempt > are processed in short succession the application attempt will still contain > a queue reference but is already removed from the list of applications for > the queue. > 2. This can happen when > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908] > removes the appAttempt from the queue and > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707] > also tries to remove the same appAttempt from the queue. > 3. On further checking, i could see that before doing > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779] > writeLock on appAttempt is taken where as for > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665] > , i don't see any writelock being taken which can result in race condition > if same appAttempt is being processed. > 4. Additionally as mentioned in the comment here when such scenario occurs > ideally we should not take down RM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication
[ https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848085#comment-17848085 ] Syed Shameerur Rahman commented on YARN-11697: -- [~wilfreds] I had some custom code/backports from higher version and hence the code lines might have differed from the OSS hadoop code base. I could see the following exception though java.lang.IllegalStateException: Given app to remove appattempt_1706879498319_86660_01 Alloc: does not exist in queue [root, demand=, running=, share=, w=1.0] So this exception comes only when the appAttempt is already removed from the queue and we try to remove it again. Throwing IllegalStateException causes the RM to shutdown with exception. Can you think of any scenario this can happen ? > Fix fair scheduler race condition in removeApplicationAttempt and > moveApplication > - > > Key: YARN-11697 > URL: https://issues.apache.org/jira/browse/YARN-11697 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > > For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with > the following exception > {code:java} > 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > APP_ATTEMPT_REMOVED to the Event Dispatcher > java.lang.IllegalStateException: Given app to remove > appattempt_1706879498319_86660_01 Alloc: does not > exist in queue [root, demand=, > running=, share=, w=1.0] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:750) > {code} > The exception seems similar to the one mentioned in YARN-5136, but it looks > like there is still some edge cases not covered by YARN-5136. > 1. On deeper look, i could see that as mentioned in the comment here. if a > call for a moveApplication and removeApplicationAttempt for the same attempt > are processed in short succession the application attempt will still contain > a queue reference but is already removed from the list of applications for > the queue. > 2. This can happen when > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908] > removes the appAttempt from the queue and > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707] > also tries to remove the same appAttempt from the queue. > 3. On further checking, i could see that before doing > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779] > writeLock on appAttempt is taken where as for > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665] > , i don't see any writelock being taken which can result in race condition > if same appAttempt is being processed. > 4. Additionally as mentioned in the comment here when such scenario occurs > ideally we should not take down RM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication
[ https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848072#comment-17848072 ] Wilfred Spiegelenburg commented on YARN-11697: -- The stack trace does not correspond to hadoop 3.2.1: [FairScheduler.java:757|https://github.com/apache/hadoop/blob/branch-3.2.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L757] That points to this line in hadoop 3.2.1 which is part of completedContainerInternal {code:java} 755 application.containerCompleted(rmContainer, containerStatus, event); 756 if (node != null) { 757 node.releaseContainer(rmContainer.getContainerId(), false); 758 } else if (LOG.isDebugEnabled()) { 759 LOG.debug("Skipping container release on removed node: " + nodeID); 760 } {code} The comment in the moveApplication around locking the app attempt are for scheduling. An application could be scheduled while being moved and that needs to be stopped. The remove of an application attempt takes a write lock on the scheduler itself. Same as the move does. So a moveApplication and removeApplicationAttempt cannot happen at the same time. they both need that lock and are serialised. I think you are looking at the wrong thing and a move is not involved. > Fix fair scheduler race condition in removeApplicationAttempt and > moveApplication > - > > Key: YARN-11697 > URL: https://issues.apache.org/jira/browse/YARN-11697 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > > For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with > the following exception > {code:java} > 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > APP_ATTEMPT_REMOVED to the Event Dispatcher > java.lang.IllegalStateException: Given app to remove > appattempt_1706879498319_86660_01 Alloc: does not > exist in queue [root, demand=, > running=, share=, w=1.0] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:750) > {code} > The exception seems similar to the one mentioned in YARN-5136, but it looks > like there is still some edge cases not covered by YARN-5136. > 1. On deeper look, i could see that as mentioned in the comment here. if a > call for a moveApplication and removeApplicationAttempt for the same attempt > are processed in short succession the application attempt will still contain > a queue reference but is already removed from the list of applications for > the queue. > 2. This can happen when > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908] > removes the appAttempt from the queue and > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707] > also tries to remove the same appAttempt from the queue. > 3. On further checking, i could see that before doing > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779] > writeLock on appAttempt is taken where as for > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665] > , i don't see any writelock being taken which can result in race condition > if same appAttempt is being processed. >
[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication
[ https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848046#comment-17848046 ] Syed Shameerur Rahman commented on YARN-11697: -- [~wilfreds] any thoughts on this ? > Fix fair scheduler race condition in removeApplicationAttempt and > moveApplication > - > > Key: YARN-11697 > URL: https://issues.apache.org/jira/browse/YARN-11697 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > > For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with > the following exception > {code:java} > 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher > (SchedulerEventDispatcher:Event Processor): Error in handling event type > APP_ATTEMPT_REMOVED to the Event Dispatcher > java.lang.IllegalStateException: Given app to remove > appattempt_1706879498319_86660_01 Alloc: does not > exist in queue [root.tier2.livy, demand=, > running=, share=, w=1.0] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:750) > {code} > The exception seems similar to the one mentioned in YARN-5136, but it looks > like there is still some edge cases not covered by YARN-5136. > 1. On deeper look, i could see that as mentioned in the comment here. if a > call for a moveApplication and removeApplicationAttempt for the same attempt > are processed in short succession the application attempt will still contain > a queue reference but is already removed from the list of applications for > the queue. > 2. This can happen when > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908] > removes the appAttempt from the queue and > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707] > also tries to remove the same appAttempt from the queue. > 3. On further checking, i could see that before doing > [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779] > writeLock on appAttempt is taken where as for > [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665] > , i don't see any writelock being taken which can result in race condition > if same appAttempt is being processed. > 4. Additionally as mentioned in the comment here when such scenario occurs > ideally we should not take down RM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org