subject:"\[jira\] \[Commented\] \(YARN\-11697\) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication"

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-21 Thread Wilfred Spiegelenburg (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848123#comment-17848123
 ] 

Wilfred Spiegelenburg commented on YARN-11697:
--

You need to figure out why you get two remove events in a row for the same 
application. This code has not change in multiple years. If this was really a 
big issue we should have seen this happen more often and years ago.

Try to reproduce without the backports and see if it still happens. You might 
have backported things that are not compatible that cause side effects.

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-21 Thread Syed Shameerur Rahman (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848119#comment-17848119
 ] 

Syed Shameerur Rahman commented on YARN-11697:
--

[~wilfreds] 
 # IMHO, when the appAttempt is not available in the queue to be removed, It 
should be handled more gracefully than throwing IllegalStateException which 
will take down the RM.
 # Since the appAttempt is anyhow not available in the queue we can safely 
throw warning message instead of throwing exception

 

Any thoughts on the above approach ?

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-21 Thread Syed Shameerur Rahman (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848086#comment-17848086
 ] 

Syed Shameerur Rahman commented on YARN-11697:
--

Additionally i could specifically see this when Application is being killed.

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-21 Thread Syed Shameerur Rahman (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848085#comment-17848085
 ] 

Syed Shameerur Rahman commented on YARN-11697:
--

[~wilfreds] 

I had some custom code/backports from higher version and hence the code lines 
might have differed from the OSS hadoop code base. I could see the following 
exception though 
java.lang.IllegalStateException: Given app to remove 
appattempt_1706879498319_86660_01 Alloc:  does not 
exist in queue [root, demand=, 
running=, share=, w=1.0]
 

So this exception comes only when the appAttempt is already removed from the 
queue and we try to remove it again. Throwing IllegalStateException causes the 
RM to shutdown with exception. Can you think of any scenario this can happen ?

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-21 Thread Wilfred Spiegelenburg (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848072#comment-17848072
 ] 

Wilfred Spiegelenburg commented on YARN-11697:
--

The stack trace does not correspond to hadoop 3.2.1: 
[FairScheduler.java:757|https://github.com/apache/hadoop/blob/branch-3.2.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L757]

That points to this line in hadoop 3.2.1 which is part of 
completedContainerInternal
{code:java}
755        application.containerCompleted(rmContainer, containerStatus, event);
756        if (node != null) {
757          node.releaseContainer(rmContainer.getContainerId(), false);
758        } else if (LOG.isDebugEnabled()) {
759          LOG.debug("Skipping container release on removed node: " + nodeID);
760        } {code}
The comment in the moveApplication around locking the app attempt are for 
scheduling. An application could be scheduled while being moved and that needs 
to be stopped. The remove of an application attempt takes a write lock on the 
scheduler itself. Same as the move does. So a moveApplication and 
removeApplicationAttempt cannot happen at the same time. they both need that 
lock and are serialised.

I think you are looking at the wrong thing and a move is not involved.

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
>

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-20 Thread Syed Shameerur Rahman (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848046#comment-17848046
 ] 

Syed Shameerur Rahman commented on YARN-11697:
--

[~wilfreds]  any thoughts on this ?

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root.tier2.livy, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

6 matches

Site Navigation

Mail list logo

Footer information