[jira] [Comment Edited] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737005#comment-17737005 ] Prabhu Joseph edited comment on YARN-11501 at 6/26/23 6:19 AM: --- >> I am not able to trace ClusterNodeTracker#updateMaxResources -> >> RMNodeImpl.getState .. in trunk code . Any private change ?? Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. During initial analysis, we were trying to fix the locking at {_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks _RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> {_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then {_}RMNode{_}) easier. This deadlock issue won't happen without the private change, so I will mark this invalid. was (Author: prabhu joseph): >> I am not able to trace ClusterNodeTracker#updateMaxResources -> >> RMNodeImpl.getState .. in trunk code . Any private change ?? Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. During initial analysis, we were trying to fix the locking at {_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks _RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> {_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then {_}RMNode{_}) easier. This deadlock issue won't happen without the private change, so I will mark this invalid. > ResourceManager deadlock due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers > -- > > Key: YARN-11501 > URL: https://issues.apache.org/jira/browse/YARN-11501 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > > We have seen a deadlock in ResourceManager due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock > on RMNode and waiting to lock SchedulerNode whereas > CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock > RMNode. > cc *Vishal Vyas* > > {code:java} > Found one Java-level deadlock: > = > "qtp1401737458-850": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > "RM Event dispatcher": > waiting for ownable synchronizer 0x0007168a7a38, (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), > which is held by "SchedulerEventDispatcher:Event Processor" > "SchedulerEventDispatcher:Event Processor": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > Java stack information for the threads listed above: > === > "qtp1401737458-850": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000717e6ff60> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at >
[jira] [Comment Edited] (YARN-11501) ResourceManager deadlock due to StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736986#comment-17736986 ] Bibin Chundatt edited comment on YARN-11501 at 6/26/23 5:10 AM: [~prabhujoseph] Did a quick scan at the call stack.. at org.apache.hadoop.yarn.server.resourcemanager.rmnode.*RMNodeImpl.getState(RMNodeImpl.java:619)* at org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307) I am not able to trace ClusterNodeTracker#updateMaxResources -> RMNodeImpl.getState .. in trunk code . Any private change ?? was (Author: bibinchundatt): [~prabhujoseph] Did a quick scan at the call stack.. Dont stack tracematching with one from OSS at org.apache.hadoop.yarn.server.resourcemanager.rmnode.*RMNodeImpl.getState(RMNodeImpl.java:619)* at org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307) I am not able to trace ClusterNodeTracker#updateMaxResources -> RMNodeImpl.getState .. in trunk code . Any private change ?? > ResourceManager deadlock due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers > -- > > Key: YARN-11501 > URL: https://issues.apache.org/jira/browse/YARN-11501 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > > We have seen a deadlock in ResourceManager due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock > on RMNode and waiting to lock SchedulerNode whereas > CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock > RMNode. > cc *Vishal Vyas* > > {code:java} > Found one Java-level deadlock: > = > "qtp1401737458-850": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > "RM Event dispatcher": > waiting for ownable synchronizer 0x0007168a7a38, (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), > which is held by "SchedulerEventDispatcher:Event Processor" > "SchedulerEventDispatcher:Event Processor": > waiting for ownable synchronizer 0x000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > Java stack information for the threads listed above: > === > "qtp1401737458-850": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000717e6ff60> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at >