Hey folks, encountering a strange problem. With a certain Tez job we end up with this situation:
- AppMaster starts, launches a container (seen in logs) - NodeManager spawns the container (seen in logs) - nothing happens anymore, the AppMaster just sits an waits there forever - the YARN UI shows 0 running container - resource-manager and node-manager do not have any trace from the container except the it traversed to running state - 2 container JVMS are running: [root@cmaster ~]# ps -ef | grep TezChild root 6144 5547 0 18:14 pts/2 00:00:00 grep TezChild yarn 26964 28254 0 16:11 ? 00:00:00 /bin/bash -c /usr/local/sun-jdk/bin/java -Xmx819m -server -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true -Djava.awt.headless=true -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412235368760_0122_01_000002 -Dtez.root.logger=INFO,CLA -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_1412235368760_0122/container_1412235368760_0122_01_000002/tmp org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 container_1412235368760_0122_01_000002 application_1412235368760_0122 1 1>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412235368760_0122_01_000002/stdout 2>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412235368760_0122_01_000002/stderr yarn 27046 26964 1 16:11 ? 00:01:49 /usr/local/sun-jdk/bin/java -Xmx819m -server -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true -Djava.awt.headless=true -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412235368760_0122_01_000002 -Dtez.root.logger=INFO,CLA -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_1412235368760_0122/container_1412235368760_0122_01_000002/tmp org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 container_1412235368760_0122_01_000002 application_1412235368760_0122 1 [root@cmaster ~]# jps 28254 NodeManager 13597 -- process information unavailable 5757 -- process information unavailable 28723 start.jar 17192 HMaster 27261 JournalNode 27046 TezChild 6154 Jps 17043 QuorumPeerMain 27418 NameNode 28057 JobHistoryServer 27716 ResourceManager 27094 DataNode Looks like there is one TezChild which is a child process from the NodeManager. We cannot get any thread dump from this one. And the 2nd TezChild is a child process from the first TezChild. Find the thread dump of the 2nd TezChild at the bottom of this mail. Any idea whats going on here ? best Johannes ------------------------------------- Attaching to process ID 27046, please wait... Debugger attached successfully. Server compiler detected. JVM version is 24.55-b03 Deadlock Detection: No deadlocks found. Thread 4399: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Compiled frame) - java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util.concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=174, line=460 (Compiled frame) - java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Object, boolean, long) @bci=102, line=359 (Compiled frame) - java.util.concurrent.SynchronousQueue.poll(long, java.util.concurrent.TimeUnit) @bci=11, line=942 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27089: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27088: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27087: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27086: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27084: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.hdfs.PeerCache.run() @bci=41, line=245 (Interpreted frame) - org.apache.hadoop.hdfs.PeerCache.access$000(org.apache.hadoop.hdfs.PeerCache) @bci=1, line=41 (Interpreted frame) - org.apache.hadoop.hdfs.PeerCache$1.run() @bci=4, line=119 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27082: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27080: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=429, line=438 (Interpreted frame) - org.apache.hadoop.hdfs.LeaseRenewer.access$700(org.apache.hadoop.hdfs.LeaseRenewer, int) @bci=2, line=71 (Interpreted frame) - org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=69, line=298 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27079: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run() @bci=265, line=502 (Interpreted frame) Thread 27077: (state = IN_NATIVE) - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int, org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted frame) - org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(int, org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52 (Interpreted frame) - org.apache.hadoop.net.unix.DomainSocketWatcher$1.run() @bci=551, line=457 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27076: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.run() @bci=26, line=668 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27071: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Compiled frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, java.util.concurrent.TimeUnit) @bci=106, line=2176 (Compiled frame) - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() @bci=61, line=183 (Compiled frame) - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() @bci=1, line=118 (Interpreted frame) - java.util.concurrent.FutureTask.run() @bci=42, line=262 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27065: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=902 (Compiled frame) - org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=947 (Compiled frame) Thread 27064: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - some user code Thread 27063: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.util.TimerThread.mainLoop() @bci=201, line=552 (Interpreted frame) - java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame) Thread 27055: (state = BLOCKED) Thread 27054: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Compiled frame) - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Compiled frame) - java.lang.ref.Finalizer$FinalizerThread.run() @bci=16, line=189 (Compiled frame) Thread 27053: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame) - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133 (Interpreted frame) Thread 27047: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.FutureTask.awaitDone(boolean, long) @bci=165, line=425 (Interpreted frame) - java.util.concurrent.FutureTask.get() @bci=13, line=187 (Interpreted frame) - org.apache.tez.runtime.task.TezTaskRunner.run() @bci=37, line=96 (Interpreted frame) - org.apache.tez.runtime.task.TezChild.run() @bci=372, line=218 (Interpreted frame) - org.apache.tez.runtime.task.TezChild.main(java.lang.String[]) @bci=116, line=433 (Interpreted frame)
