Not sure whats happening. Does this always repro? TezChild does not spawn another process. It runs tasks in threads inside its own process. Could be a cluster issue too since YARN is not reporting the containers correctly.
Is this a mini cluster? It could be that the test execution shutdown the minicluster but left the AM and tasks orphaned. Bikas -----Original Message----- From: Johannes Zillmann [mailto:[email protected]] Sent: Thursday, October 02, 2014 10:28 AM To: [email protected] Subject: 2 processes for the same container both not registered in YARN UI Hey folks, encountering a strange problem. With a certain Tez job we end up with this situation: - AppMaster starts, launches a container (seen in logs) - NodeManager spawns the container (seen in logs) - nothing happens anymore, the AppMaster just sits an waits there forever - the YARN UI shows 0 running container - resource-manager and node-manager do not have any trace from the container except the it traversed to running state - 2 container JVMS are running: [root@cmaster ~]# ps -ef | grep TezChild root 6144 5547 0 18:14 pts/2 00:00:00 grep TezChild yarn 26964 28254 0 16:11 ? 00:00:00 /bin/bash -c /usr/local/sun-jdk/bin/java -Xmx819m -server -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true -Djava.awt.headless=true -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223 5368760_0122/container_1412235368760_0122_01_000002 -Dtez.root.logger=INFO,CLA -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141 2235368760_0122/container_1412235368760_0122_01_000002/tmp org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 container_1412235368760_0122_01_000002 application_1412235368760_0122 1 1>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412 235368760_0122_01_000002/stdout 2>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412 235368760_0122_01_000002/stderr yarn 27046 26964 1 16:11 ? 00:01:49 /usr/local/sun-jdk/bin/java -Xmx819m -server -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true -Djava.awt.headless=true -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223 5368760_0122/container_1412235368760_0122_01_000002 -Dtez.root.logger=INFO,CLA -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141 2235368760_0122/container_1412235368760_0122_01_000002/tmp org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 container_1412235368760_0122_01_000002 application_1412235368760_0122 1 [root@cmaster ~]# jps 28254 NodeManager 13597 -- process information unavailable 5757 -- process information unavailable 28723 start.jar 17192 HMaster 27261 JournalNode 27046 TezChild 6154 Jps 17043 QuorumPeerMain 27418 NameNode 28057 JobHistoryServer 27716 ResourceManager 27094 DataNode Looks like there is one TezChild which is a child process from the NodeManager. We cannot get any thread dump from this one. And the 2nd TezChild is a child process from the first TezChild. Find the thread dump of the 2nd TezChild at the bottom of this mail. Any idea whats going on here ? best Johannes ------------------------------------- Attaching to process ID 27046, please wait... Debugger attached successfully. Server compiler detected. JVM version is 24.55-b03 Deadlock Detection: No deadlocks found. Thread 4399: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Compiled frame) - java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util .concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=174, line=460 (Compiled frame) - java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Obj ect, boolean, long) @bci=102, line=359 (Compiled frame) - java.util.concurrent.SynchronousQueue.poll(long, java.util.concurrent.TimeUnit) @bci=11, line=942 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr eadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27089: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai t() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27088: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai t() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27087: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai t() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27086: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai t() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27084: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.hdfs.PeerCache.run() @bci=41, line=245 (Interpreted frame) - org.apache.hadoop.hdfs.PeerCache.access$000(org.apache.hadoop.hdfs.PeerCac he) @bci=1, line=41 (Interpreted frame) - org.apache.hadoop.hdfs.PeerCache$1.run() @bci=4, line=119 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27082: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai t() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27080: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=429, line=438 (Interpreted frame) - org.apache.hadoop.hdfs.LeaseRenewer.access$700(org.apache.hadoop.hdfs.Leas eRenewer, int) @bci=2, line=71 (Interpreted frame) - org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=69, line=298 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27079: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run() @bci=265, line=502 (Interpreted frame) Thread 27077: (state = IN_NATIVE) - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int, org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted frame) - org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(int, org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52 (Interpreted frame) - org.apache.hadoop.net.unix.DomainSocketWatcher$1.run() @bci=551, line=457 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27076: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai t() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.run() @bci=26, line=668 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27071: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Compiled frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai t(long, java.util.concurrent.TimeUnit) @bci=106, line=2176 (Compiled frame) - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() @bci=61, line=183 (Compiled frame) - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() @bci=1, line=118 (Interpreted frame) - java.util.concurrent.FutureTask.run() @bci=42, line=262 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr eadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) Thread 27065: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=902 (Compiled frame) - org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=947 (Compiled frame) Thread 27064: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai t() @bci=42, line=2043 (Interpreted frame) - some user code Thread 27063: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.util.TimerThread.mainLoop() @bci=201, line=552 (Interpreted frame) - java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame) Thread 27055: (state = BLOCKED) Thread 27054: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Compiled frame) - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Compiled frame) - java.lang.ref.Finalizer$FinalizerThread.run() @bci=16, line=189 (Compiled frame) Thread 27053: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame) - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133 (Interpreted frame) Thread 27047: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.FutureTask.awaitDone(boolean, long) @bci=165, line=425 (Interpreted frame) - java.util.concurrent.FutureTask.get() @bci=13, line=187 (Interpreted frame) - org.apache.tez.runtime.task.TezTaskRunner.run() @bci=37, line=96 (Interpreted frame) - org.apache.tez.runtime.task.TezChild.run() @bci=372, line=218 (Interpreted frame) - org.apache.tez.runtime.task.TezChild.main(java.lang.String[]) @bci=116, line=433 (Interpreted frame) -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
