[ https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737005#comment-17737005 ]
Prabhu Joseph commented on YARN-11501: -------------------------------------- >> I am not able to trace ClusterNodeTracker#updateMaxResources -> >> RMNodeImpl.getState .. in trunk code . Any private change ?? Thanks, Bibin Chundatt. Yes, you are right; this part is a private change. During initial analysis, we were trying to fix the locking at {_}StatusUpdateWhenHealthyTransition{_}.{_}hasScheduledAMContainers{_} (locks _RMNode_ first and then {_}SchedulerNode{_}). But we found the fix at our private change ({_}ClusterNodeTracker{_}.{_}updateMaxResources{_} -> {_}RMNodeImpl{_}.{_}getState{_}, which locks _SchedulerNode_ first and then {_}RMNode{_}) easier. This deadlock issue won't happen without the private change, so I will mark this invalid. > ResourceManager deadlock due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers > ------------------------------------------------------------------------------------------ > > Key: YARN-11501 > URL: https://issues.apache.org/jira/browse/YARN-11501 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.4.0 > Reporter: Prabhu Joseph > Assignee: Prabhu Joseph > Priority: Critical > > We have seen a deadlock in ResourceManager due to > StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock > on RMNode and waiting to lock SchedulerNode whereas > CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock > RMNode. > cc *Vishal Vyas* > > {code:java} > Found one Java-level deadlock: > ============================= > "qtp1401737458-850": > waiting for ownable synchronizer 0x0000000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > "RM Event dispatcher": > waiting for ownable synchronizer 0x00000007168a7a38, (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), > which is held by "SchedulerEventDispatcher:Event Processor" > "SchedulerEventDispatcher:Event Processor": > waiting for ownable synchronizer 0x0000000717e6ff60, (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), > which is held by "RM Event dispatcher" > Java stack information for the threads listed above: > =================================================== > "qtp1401737458-850": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0000000717e6ff60> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:180) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119) > at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133) > at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130) > at > com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203) > - locked <0x0000000791ad1fd0> (a > com.google.inject.servlet.GuiceFilter$Context) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130) > at > org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) > at > org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) > at > org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110) > at > org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) > at > org.apache.hadoop.security.http.CrossOriginFilter.doFilter(CrossOriginFilter.java:98) > at > org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1764) > at > org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > at org.eclipse.jetty.server.Server.handle(Server.java:516) > at > org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) > at > org.eclipse.jetty.server.HttpChannel$$Lambda$114/1946743342.dispatch(Unknown > Source) > at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) > at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) > at > org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) > at java.lang.Thread.run(Thread.java:750) > "RM Event dispatcher": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000007168a7a38> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNode(ClusterNodeTracker.java:135) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getSchedulerNode(AbstractYarnScheduler.java:792) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.hasScheduledAMContainers(RMNodeImpl.java:1427) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.transition(RMNodeImpl.java:1372) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.transition(RMNodeImpl.java:1342) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > - locked <0x0000000717e70928> (a > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:768) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:104) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1267) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1251) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:241) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:156) > at java.lang.Thread.run(Thread.java:750) > "SchedulerEventDispatcher:Event Processor": > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0000000717e6ff60> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.lambda$updateMaxResources$0(ClusterNodeTracker.java:337) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker$$Lambda$131/1663466005.accept(Unknown > Source) > at java.util.HashMap$Values.forEach(HashMap.java:982) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:337) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:220) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83) > at java.lang.Thread.run(Thread.java:750) > Found 1 deadlock. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org