Not sure whats happening. Does this always repro? TezChild does not spawn
another process. It runs tasks in threads inside its own process. Could be
a cluster issue too since YARN is not reporting the containers correctly.

Is this a mini cluster? It could be that the test execution shutdown the
minicluster but left the AM and tasks orphaned.

Bikas

-----Original Message-----
From: Johannes Zillmann [mailto:[email protected]]
Sent: Thursday, October 02, 2014 10:28 AM
To: [email protected]
Subject: 2 processes for the same container both not registered in YARN UI

Hey folks,

encountering a strange problem. With a certain Tez job we end up with this
situation:

- AppMaster starts, launches a container (seen in logs)
- NodeManager spawns the container (seen in logs)
- nothing happens anymore, the AppMaster just sits an waits there forever
- the YARN UI shows 0 running container
- resource-manager and node-manager do not have any trace from the
container except the it traversed to running state
- 2 container JVMS are running:

[root@cmaster ~]# ps -ef | grep TezChild
root      6144  5547  0 18:14 pts/2    00:00:00 grep TezChild
yarn     26964 28254  0 16:11 ?        00:00:00 /bin/bash -c
/usr/local/sun-jdk/bin/java  -Xmx819m -server
-Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc
-XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC
-Dhadoop.metrics.log.level=WARN  -Dapple.awt.UIElement=true
-Djava.awt.headless=true
-Dlog4j.configuration=tez-container-log4j.properties
-Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223
5368760_0122/container_1412235368760_0122_01_000002
-Dtez.root.logger=INFO,CLA
-Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141
2235368760_0122/container_1412235368760_0122_01_000002/tmp
org.apache.tez.runtime.task.TezChild 192.168.179.151 60298
container_1412235368760_0122_01_000002 application_1412235368760_0122 1
1>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412
235368760_0122_01_000002/stdout
2>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412
235368760_0122_01_000002/stderr
yarn     27046 26964  1 16:11 ?        00:01:49
/usr/local/sun-jdk/bin/java -Xmx819m -server
-Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc
-XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC
-Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true
-Djava.awt.headless=true
-Dlog4j.configuration=tez-container-log4j.properties
-Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223
5368760_0122/container_1412235368760_0122_01_000002
-Dtez.root.logger=INFO,CLA
-Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141
2235368760_0122/container_1412235368760_0122_01_000002/tmp
org.apache.tez.runtime.task.TezChild 192.168.179.151 60298
container_1412235368760_0122_01_000002 application_1412235368760_0122 1
[root@cmaster ~]# jps
28254 NodeManager
13597 -- process information unavailable
5757 -- process information unavailable
28723 start.jar
17192 HMaster
27261 JournalNode
27046 TezChild
6154 Jps
17043 QuorumPeerMain
27418 NameNode
28057 JobHistoryServer
27716 ResourceManager
27094 DataNode

Looks like there is one TezChild which is a child process from the
NodeManager. We cannot get any thread dump from this one.
And the 2nd TezChild is a child process from the first TezChild.

Find the thread dump of the 2nd TezChild at the bottom of this mail.
Any idea whats going on here ?

best
Johannes

-------------------------------------
Attaching to process ID 27046, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.55-b03
Deadlock Detection:

No deadlocks found.

Thread 4399: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
may be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object,
long) @bci=20, line=226 (Compiled frame)
 -
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util
.concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=174,
line=460 (Compiled frame)
 -
java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Obj
ect, boolean, long) @bci=102, line=359 (Compiled frame)
 - java.util.concurrent.SynchronousQueue.poll(long,
java.util.concurrent.TimeUnit) @bci=11, line=942 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068
(Compiled frame)
 -
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
eadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27089: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
line=186 (Interpreted frame)
 -
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
t() @bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
(Interpreted frame)
 -
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27088: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
line=186 (Interpreted frame)
 -
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
t() @bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
(Interpreted frame)
 -
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27087: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
line=186 (Interpreted frame)
 -
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
t() @bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
(Interpreted frame)
 -
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27086: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
line=186 (Interpreted frame)
 -
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
t() @bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
(Interpreted frame)
 -
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27084: (state = BLOCKED)
 - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
 - org.apache.hadoop.hdfs.PeerCache.run() @bci=41, line=245 (Interpreted
frame)
 -
org.apache.hadoop.hdfs.PeerCache.access$000(org.apache.hadoop.hdfs.PeerCac
he) @bci=1, line=41 (Interpreted frame)
 - org.apache.hadoop.hdfs.PeerCache$1.run() @bci=4, line=119 (Interpreted
frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27082: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
line=186 (Interpreted frame)
 -
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
t() @bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
(Interpreted frame)
 -
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27080: (state = BLOCKED)
 - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
 - org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=429, line=438
(Interpreted frame)
 -
org.apache.hadoop.hdfs.LeaseRenewer.access$700(org.apache.hadoop.hdfs.Leas
eRenewer, int) @bci=2, line=71 (Interpreted frame)
 - org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=69, line=298
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27079: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
 - org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run() @bci=265,
line=502 (Interpreted frame)


Thread 27077: (state = IN_NATIVE)
 - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int,
org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted
frame)
 - org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(int,
org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52
(Interpreted frame)
 - org.apache.hadoop.net.unix.DomainSocketWatcher$1.run() @bci=551,
line=457 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27076: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
line=186 (Interpreted frame)
 -
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
t() @bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
(Interpreted frame)
 - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.run() @bci=26,
line=668 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27071: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
may be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object,
long) @bci=20, line=226 (Compiled frame)
 -
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
t(long, java.util.concurrent.TimeUnit) @bci=106, line=2176 (Compiled
frame)
 - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call()
@bci=61, line=183 (Compiled frame)
 - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call()
@bci=1, line=118 (Interpreted frame)
 - java.util.concurrent.FutureTask.run() @bci=42, line=262 (Interpreted
frame)
 -
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
eadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27065: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
 - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=902
(Compiled frame)
 - org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=947
(Compiled frame)


Thread 27064: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
line=186 (Interpreted frame)
 -
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
t() @bci=42, line=2043 (Interpreted frame)
 - some user code


Thread 27063: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
 - java.util.TimerThread.mainLoop() @bci=201, line=552 (Interpreted frame)
 - java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame)


Thread 27055: (state = BLOCKED)


Thread 27054: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Compiled
frame)
 - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Compiled frame)
 - java.lang.ref.Finalizer$FinalizerThread.run() @bci=16, line=189
(Compiled frame)


Thread 27053: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
 - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame)
 - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133
(Interpreted frame)


Thread 27047: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
line=186 (Interpreted frame)
 - java.util.concurrent.FutureTask.awaitDone(boolean, long) @bci=165,
line=425 (Interpreted frame)
 - java.util.concurrent.FutureTask.get() @bci=13, line=187 (Interpreted
frame)
 - org.apache.tez.runtime.task.TezTaskRunner.run() @bci=37, line=96
(Interpreted frame)
 - org.apache.tez.runtime.task.TezChild.run() @bci=372, line=218
(Interpreted frame)
 - org.apache.tez.runtime.task.TezChild.main(java.lang.String[]) @bci=116,
line=433 (Interpreted frame)

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to