Jon Bender created YARN-10221:
---------------------------------

             Summary: Nodemanager lockups on printEventQueueDetails
                 Key: YARN-10221
                 URL: https://issues.apache.org/jira/browse/YARN-10221
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.2.1
         Environment: We're running stock hadoop3.2.1 with cgroups / 
LinuxContainerExecutor.

Java version:
{noformat}
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) {noformat}
 
            Reporter: Jon Bender


We are seeing a rare, but critical bug on our production clusters running 
hadoop 3.2.1. The central issue is that the NodeManager is locked up trying to 
print details about the event queues. This feature was added in YARN-8995

The main symptoms are:
- Containers stuck in an Initing phase (ContainersIniting in jmx)
- NM stops accepting RPC calls

Failed job submissions manifest as socket timeouts to the RPC port:

{code}
INFO - diagnostics: Application application_1585693823779_0028 failed 1 times 
(global limit =2; local limit is =1) due to Error launching 
appattempt_1585693823779_0028_000001. Got exception: 
java.net.SocketTimeoutException: Call From 
hadoopresourcesec--0c94ac2238c29f40e.production/10.68.12.37 to 
hadoopdatanodei--06bad095f795f0725.production:8039 failed on socket timeout 
exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.68.12.37:59892 
remote=hadoopdatanodei--06bad095f795f0725.production/10.68.58.224:8039]; For 
more details see:  http://wiki.apache.org/hadoop/SocketTimeout
{code}

Relevant outputs from {{jstack -l:}} on an affected NodeManager. All IPC 
threads are blocked waiting on the lock on the eventQueue

Thread printing event queue details - this runs indefinitely
{code:java}
"Public Localizer" #62 prio=5 os_prio=0 tid=0x00007f488d948000 nid=0x1cee9 
runnable [0x00007f4890571000]"Public Localizer" #62 prio=5 os_prio=0 
tid=0x00007f488d948000 nid=0x1cee9 runnable [0x00007f4890571000]   
java.lang.Thread.State: RUNNABLE at 
java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
 at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
 at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:252)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:243)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
 - locked <0x00007f4906f49230> (a 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:200)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:188)
 - locked <0x00007f48f47a9658> (a 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:59)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:982)

Locked ownable synchronizers: - <0x00007f48f5a7a950> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f48f5a7a9a8> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f4909f25278> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
{code}

Sample IPC handler thread (8039 is our NM RPC port). All threads waiting on 
0x00007f48f5a7a9a8

{code:java}
"IPC Server handler 19 on default port 8039" #230 daemon prio=5 os_prio=0 
tid=0x00007f488d8e2800 nid=0x1cede waiting on condition 
[0x00007f489107b000]"IPC Server handler 19 on default port 8039" #230 daemon 
prio=5 os_prio=0 tid=0x00007f488d8e2800 nid=0x1cede waiting on condition 
[0x00007f489107b000]   java.lang.Thread.State: WAITING (parking) at 
sun.misc.Unsafe.park(Native Method) - parking to wait for  <0x00007f48f5a7a9a8> 
(a java.util.concurrent.locks.ReentrantLock$NonfairSync) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
 at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
 at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.sendKillEvent(ContainerImpl.java:1030)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainerInternal(ContainerManagerImpl.java:1439)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainers(ContainerManagerImpl.java:1411)
 at 
org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.stopContainers(ContainerManagementProtocolPBServiceImpl.java:115)
 at 
org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:225)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
   Locked ownable synchronizers: - None
{code}
 

Single thread waiting on 0x00007f489016f000
{code:java}
"NM ContainerManager dispatcher" #243 prio=5 os_prio=0 tid=0x00007f488d145000 
nid=0x1ceec waiting on condition [0x00007f489016f000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f48f5a7a950> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
        at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
        at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:125)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to