Jon Bender created YARN-10221: --------------------------------- Summary: Nodemanager lockups on printEventQueueDetails Key: YARN-10221 URL: https://issues.apache.org/jira/browse/YARN-10221 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.2.1 Environment: We're running stock hadoop3.2.1 with cgroups / LinuxContainerExecutor.
Java version: {noformat} openjdk version "1.8.0_242" OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08) OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) {noformat} Reporter: Jon Bender We are seeing a rare, but critical bug on our production clusters running hadoop 3.2.1. The central issue is that the NodeManager is locked up trying to print details about the event queues. This feature was added in YARN-8995 The main symptoms are: - Containers stuck in an Initing phase (ContainersIniting in jmx) - NM stops accepting RPC calls Failed job submissions manifest as socket timeouts to the RPC port: {code} INFO - diagnostics: Application application_1585693823779_0028 failed 1 times (global limit =2; local limit is =1) due to Error launching appattempt_1585693823779_0028_000001. Got exception: java.net.SocketTimeoutException: Call From hadoopresourcesec--0c94ac2238c29f40e.production/10.68.12.37 to hadoopdatanodei--06bad095f795f0725.production:8039 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.68.12.37:59892 remote=hadoopdatanodei--06bad095f795f0725.production/10.68.58.224:8039]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout {code} Relevant outputs from {{jstack -l:}} on an affected NodeManager. All IPC threads are blocked waiting on the lock on the eventQueue Thread printing event queue details - this runs indefinitely {code:java} "Public Localizer" #62 prio=5 os_prio=0 tid=0x00007f488d948000 nid=0x1cee9 runnable [0x00007f4890571000]"Public Localizer" #62 prio=5 os_prio=0 tid=0x00007f488d948000 nid=0x1cee9 runnable [0x00007f4890571000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:243) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) - locked <0x00007f4906f49230> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:188) - locked <0x00007f48f47a9658> (a org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:59) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:982) Locked ownable synchronizers: - <0x00007f48f5a7a950> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f48f5a7a9a8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f4909f25278> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) {code} Sample IPC handler thread (8039 is our NM RPC port). All threads waiting on 0x00007f48f5a7a9a8 {code:java} "IPC Server handler 19 on default port 8039" #230 daemon prio=5 os_prio=0 tid=0x00007f488d8e2800 nid=0x1cede waiting on condition [0x00007f489107b000]"IPC Server handler 19 on default port 8039" #230 daemon prio=5 os_prio=0 tid=0x00007f488d8e2800 nid=0x1cede waiting on condition [0x00007f489107b000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f48f5a7a9a8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.sendKillEvent(ContainerImpl.java:1030) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainerInternal(ContainerManagerImpl.java:1439) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainers(ContainerManagerImpl.java:1411) at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.stopContainers(ContainerManagementProtocolPBServiceImpl.java:115) at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:225) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) Locked ownable synchronizers: - None {code} Single thread waiting on 0x00007f489016f000 {code:java} "NM ContainerManager dispatcher" #243 prio=5 os_prio=0 tid=0x00007f488d145000 nid=0x1ceec waiting on condition [0x00007f489016f000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f48f5a7a950> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:125) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - None {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org