[ https://issues.apache.org/jira/browse/YARN-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825988#comment-13825988 ]
Omkar Vinit Joshi commented on YARN-1422: ----------------------------------------- Yes this looks to be a problem. check this [synchronization locking problem | https://issues.apache.org/jira/browse/YARN-897?focusedCommentId=13706284&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13706284] The ordering always should be from root to leaf queue. I think there can be other places too where this ordering is mixed. > RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a > container is completing > ---------------------------------------------------------------------------------------------------- > > Key: YARN-1422 > URL: https://issues.apache.org/jira/browse/YARN-1422 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager > Affects Versions: 2.2.0 > Reporter: Adam Kawa > Priority: Critical > > If getQueueUserAclInfo() on a parent/root queue (e.g. via > CapacityScheduler.getQueueUserAclInfo) is called, and a container is > completing, then the ResourceManager can deadlock. > It is similar to https://issues.apache.org/jira/browse/YARN-325. > *More details:* > * Thread A > 1) In a synchronized block of code (a lockid > 0x00000000c18d8870=LeafQueue.class), LeafQueue.completedContainer wants to > inform the parent queue that a container is being completed and invokes > ParentQueue.completedContainer method. > 3) The ParentQueue.completedContainer waits to aquire a lock on itself (a > lockid 0x00000000c1846350=ParentQueue.class) to go to synchronized block of > code. It can not accuire this lock, because Thread B already has this lock. > * Thread B > 0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This > method invokes a synchronized method on ParentQueue.class i.e. > ParentQueue.getQueueUserAclInfo (a lockid > 0x00000000c1846350=ParentQueue.class) and aquires the lock that Thread A will > be waiting for. > 2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue > acls and it wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, > but it does not have a lock on LeafQueue.class (a lockid > 0x00000000c18d8870=LeafQueue.class). This lock is already held by > LeafQueue.completedContainer in Thread A. > The order that causes the deadlock: B0 -> A1 -> B2 -> A3. > *Java Stacktrace* > {code} > Found one Java-level deadlock: > ============================= > "1956747953@qtp-109760451-1959": > waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), > which is held by "IPC Server handler 39 on 8032" > "IPC Server handler 39 on 8032": > waiting to lock monitor 0x00000000422bbc58 (object 0x00000000c18d8870, a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue), > which is held by "ResourceManager Event Processor" > "ResourceManager Event Processor": > waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), > which is held by "IPC Server handler 39 on 8032" > Java stack information for the threads listed above: > =================================================== > "1956747953@qtp-109760451-1959": > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276) > - waiting to lock <0x00000000c1846350> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.<init>(CapacitySchedulerInfo.java:49) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) > at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76) > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > at > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > "IPC Server handler 39 on 8032": > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueUserAclInfo(LeafQueue.java:544) > - waiting to lock <0x00000000c18d8870> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueUserAclInfo(ParentQueue.java:351) > - locked <0x00000000c1846350> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueUserAclInfo(CapacityScheduler.java:622) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:517) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueUserAcls(ApplicationClientProtocolPBServiceImpl.java:225) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:255) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047) > "ResourceManager Event Processor": > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.completedContainer(ParentQueue.java:693) > - waiting to lock <0x00000000c1846350> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1460) > - locked <0x00000000c18d8870> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:838) > - locked <0x00000000c1846310> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:648) > - locked <0x00000000c1846310> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:662) > Found 1 deadlock. > {code} -- This message was sent by Atlassian JIRA (v6.1#6144)