[jira] [Comment Edited] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-26 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737005#comment-17737005
 ] 

Prabhu Joseph edited comment on YARN-11501 at 6/26/23 6:19 AM:
---

>> I am not able to trace ClusterNodeTracker#updateMaxResources -> 
>> RMNodeImpl.getState .. in trunk code . Any private change ??

Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. 
During initial analysis, we were trying to fix the locking at 
{_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks 
_RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our 
private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> 
{_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then 
{_}RMNode{_}) easier.

This deadlock issue won't happen without the private change, so I will mark 
this invalid.


was (Author: prabhu joseph):
>> I am not able to trace ClusterNodeTracker#updateMaxResources -> 
>> RMNodeImpl.getState .. in trunk code . Any private change ??

Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. 
During initial analysis, we were trying to fix the locking at 
{_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks 
_RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our 
private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> 
{_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then 
{_}RMNode{_}) easier.

This deadlock issue won't happen without the private change, so I will mark 
this invalid.

 

 

 

 

 

 

 

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> 

[jira] [Comment Edited] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-25 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736986#comment-17736986
 ] 

Bibin Chundatt edited comment on YARN-11501 at 6/26/23 5:10 AM:


[~prabhujoseph] Did a quick scan at the call stack.. 

at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.*RMNodeImpl.getState(RMNodeImpl.java:619)*
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307)

I am not able to trace ClusterNodeTracker#updateMaxResources -> 
RMNodeImpl.getState .. in trunk code . Any private change ??


was (Author: bibinchundatt):
[~prabhujoseph] Did a quick scan at the call stack.. Dont stack tracematching 
with one from OSS

at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.*RMNodeImpl.getState(RMNodeImpl.java:619)*
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307)

I am not able to trace ClusterNodeTracker#updateMaxResources -> 
RMNodeImpl.getState .. in trunk code . Any private change ??

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
>