[jira] [Comment Edited] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-25 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737005#comment-17737005
 ] 

Prabhu Joseph edited comment on YARN-11501 at 6/26/23 6:19 AM:
---

>> I am not able to trace ClusterNodeTracker#updateMaxResources -> 
>> RMNodeImpl.getState .. in trunk code . Any private change ??

Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. 
During initial analysis, we were trying to fix the locking at 
{_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks 
_RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our 
private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> 
{_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then 
{_}RMNode{_}) easier.

This deadlock issue won't happen without the private change, so I will mark 
this invalid.


was (Author: prabhu joseph):
>> I am not able to trace ClusterNodeTracker#updateMaxResources -> 
>> RMNodeImpl.getState .. in trunk code . Any private change ??

Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. 
During initial analysis, we were trying to fix the locking at 
{_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks 
_RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our 
private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> 
{_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then 
{_}RMNode{_}) easier.

This deadlock issue won't happen without the private change, so I will mark 
this invalid.

 

 

 

 

 

 

 

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch

[jira] [Commented] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-25 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737005#comment-17737005
 ] 

Prabhu Joseph commented on YARN-11501:
--

>> I am not able to trace ClusterNodeTracker#updateMaxResources -> 
>> RMNodeImpl.getState .. in trunk code . Any private change ??

Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. 
During initial analysis, we were trying to fix the locking at 
{_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks 
_RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our 
private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> 
{_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then 
{_}RMNode{_}) easier.

This deadlock issue won't happen without the private change, so I will mark 
this invalid.

 

 

 

 

 

 

 

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>   at 
> com.sun.jersey.server.impl.applica

[jira] [Resolved] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-25 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-11501.
--
Resolution: Invalid

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>   at 
> org.apache.hadoop.y

[jira] [Comment Edited] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-25 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736986#comment-17736986
 ] 

Bibin Chundatt edited comment on YARN-11501 at 6/26/23 5:10 AM:


[~prabhujoseph] Did a quick scan at the call stack.. 

at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.*RMNodeImpl.getState(RMNodeImpl.java:619)*
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307)

I am not able to trace ClusterNodeTracker#updateMaxResources -> 
RMNodeImpl.getState .. in trunk code . Any private change ??


was (Author: bibinchundatt):
[~prabhujoseph] Did a quick scan at the call stack.. Dont stack tracematching 
with one from OSS

at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.*RMNodeImpl.getState(RMNodeImpl.java:619)*
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307)

I am not able to trace ClusterNodeTracker#updateMaxResources -> 
RMNodeImpl.getState .. in trunk code . Any private change ??

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
>

[jira] [Commented] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers

2023-06-25 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736986#comment-17736986
 ] 

Bibin Chundatt commented on YARN-11501:
---

[~prabhujoseph] Did a quick scan at the call stack.. Dont stack tracematching 
with one from OSS

at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.*RMNodeImpl.getState(RMNodeImpl.java:619)*
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307)

I am not able to trace ClusterNodeTracker#updateMaxResources -> 
RMNodeImpl.getState .. in trunk code . Any private change ??

> ResourceManager deadlock due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> --
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
>
> We have seen a deadlock in ResourceManager due to 
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock 
> on RMNode and waiting to lock SchedulerNode whereas 
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock 
> RMNode. 
> cc *Vishal Vyas*
>  
> {code:java}
> Found one Java-level deadlock:
> =
> "qtp1401737458-850":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> "RM Event dispatcher":
>   waiting for ownable synchronizer 0x0007168a7a38, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
>   which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
>   waiting for ownable synchronizer 0x000717e6ff60, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===
> "qtp1401737458-850":
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x000717e6ff60> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>   at 
> com.sun.jersey.server.impl.application.

[jira] [Commented] (YARN-11517) Improve Federation#RouterCLI deregisterSubCluster Code

2023-06-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736892#comment-17736892
 ] 

ASF GitHub Bot commented on YARN-11517:
---

hadoop-yetus commented on PR #5766:
URL: https://github.com/apache/hadoop/pull/5766#issuecomment-1606151656

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  3s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  17m 47s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m 53s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   8m 35s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   8m 15s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   2m 14s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 41s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   2m 33s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m  5s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 31s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m  6s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   8m  6s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   7m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   7m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m 52s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 36s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 31s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 27s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   2m 31s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 34s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  28m 50s |  |  hadoop-yarn-client in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   0m 56s |  |  hadoop-yarn-server-router in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 13s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 181m 23s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5766/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5766 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux fbb7510bd25b 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 6423c5342da06ccbe9f21d0cb158834d3fed06d8 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5766/4/testReport/ |
   | Max. process+thread count | 712 (vs. ulimit of 5500) |
   | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoo

[jira] [Commented] (YARN-11517) Improve Federation#RouterCLI deregisterSubCluster Code

2023-06-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736868#comment-17736868
 ] 

ASF GitHub Bot commented on YARN-11517:
---

slfan1989 commented on code in PR #5766:
URL: https://github.com/apache/hadoop/pull/5766#discussion_r1241152687


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/src/main/java/org/apache/hadoop/yarn/server/router/rmadmin/FederationRMAdminInterceptor.java:
##
@@ -879,9 +879,11 @@ private DeregisterSubClusters deregisterSubCluster(String 
reqSubClusterId) {
   SubClusterState subClusterState = subClusterInfo.getState();
   long lastHeartBeat = subClusterInfo.getLastHeartBeat();
   Date lastHeartBeatDate = new Date(lastHeartBeat);
-
+  String heartBeatTimeOut =

Review Comment:
   I will modify the code.





> Improve Federation#RouterCLI deregisterSubCluster Code
> --
>
> Key: YARN-11517
> URL: https://issues.apache.org/jira/browse/YARN-11517
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation, router
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9610) HeartbeatCallBack int FederationInterceptor clear AMRMToken in response from UAM should before add to aysncResponseSink

2023-06-25 Thread Morty Zhong (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736840#comment-17736840
 ] 

Morty Zhong commented on YARN-9610:
---

[~walhl] Yes, i compared these two patch. This issue can be closed

> HeartbeatCallBack int FederationInterceptor clear AMRMToken in response from 
> UAM should before add to aysncResponseSink 
> 
>
> Key: YARN-9610
> URL: https://issues.apache.org/jira/browse/YARN-9610
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: amrmproxy, federation
>Affects Versions: 3.2.0
>Reporter: Morty Zhong
>Assignee: Morty Zhong
>Priority: Major
> Attachments: YARN-9610.patch.1, YARN-9610.patch.2
>
>
> in federation, `allocate` is async. the response from RM is cached in 
> `asyncResponseSink`.
> the final allocate response is merged from all RMs allocate response. merge 
> will throw exception when AMRMToken from UAM response is not null.
> But set AMRMToken from UAM response to null is not in the scope of lock. so 
> there will be a change merge see that  AMRMToken from UAM response is not 
> null.
> so we should clear the token before add response to asyncResponseSink
>  
>  
> {code:java}
> synchronized (asyncResponseSink) {
>   List responses = null;
>   if (asyncResponseSink.containsKey(subClusterId)) {
> responses = asyncResponseSink.get(subClusterId);
>   } else {
> responses = new ArrayList<>();
> asyncResponseSink.put(subClusterId, responses);
>   }
>   responses.add(response);
>   // Notify main thread about the response arrival
>   asyncResponseSink.notifyAll();
> }
> ...
> if (this.isUAM && response.getAMRMToken() != null) {
>   Token newToken = ConverterUtils
>   .convertFromYarn(response.getAMRMToken(), (Text) null);
>   // Do not further propagate the new amrmToken for UAM
>   response.setAMRMToken(null);
> ...{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org