Hi Johannes,

To clarify,  are you concerned about the 2 processes with pids 26964 and 27046 
? If yes, the first one is the bash script invoked by the YARN NodeManager to 
launch the actual TezChild JVM. 

— HItesh

On Oct 13, 2014, at 3:18 AM, Johannes Zillmann <[email protected]> wrote:

> Hey Bikas,
> 
> this was reproducible but the hang causes by some business logic in one of 
> our tasks waiting forever on something.
> How the task jvm's were arranged looked strange but i not investigate further 
> since fixing our business logic fixed the issue!
> 
> thanks
> Johannes
> 
> On 03 Oct 2014, at 02:39, Bikas Saha <[email protected]> wrote:
> 
>> Not sure whats happening. Does this always repro? TezChild does not spawn
>> another process. It runs tasks in threads inside its own process. Could be
>> a cluster issue too since YARN is not reporting the containers correctly.
>> 
>> Is this a mini cluster? It could be that the test execution shutdown the
>> minicluster but left the AM and tasks orphaned.
>> 
>> Bikas
>> 
>> -----Original Message-----
>> From: Johannes Zillmann [mailto:[email protected]]
>> Sent: Thursday, October 02, 2014 10:28 AM
>> To: [email protected]
>> Subject: 2 processes for the same container both not registered in YARN UI
>> 
>> Hey folks,
>> 
>> encountering a strange problem. With a certain Tez job we end up with this
>> situation:
>> 
>> - AppMaster starts, launches a container (seen in logs)
>> - NodeManager spawns the container (seen in logs)
>> - nothing happens anymore, the AppMaster just sits an waits there forever
>> - the YARN UI shows 0 running container
>> - resource-manager and node-manager do not have any trace from the
>> container except the it traversed to running state
>> - 2 container JVMS are running:
>> 
>> [root@cmaster ~]# ps -ef | grep TezChild
>> root      6144  5547  0 18:14 pts/2    00:00:00 grep TezChild
>> yarn     26964 28254  0 16:11 ?        00:00:00 /bin/bash -c
>> /usr/local/sun-jdk/bin/java  -Xmx819m -server
>> -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc
>> -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC
>> -Dhadoop.metrics.log.level=WARN  -Dapple.awt.UIElement=true
>> -Djava.awt.headless=true
>> -Dlog4j.configuration=tez-container-log4j.properties
>> -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223
>> 5368760_0122/container_1412235368760_0122_01_000002
>> -Dtez.root.logger=INFO,CLA
>> -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141
>> 2235368760_0122/container_1412235368760_0122_01_000002/tmp
>> org.apache.tez.runtime.task.TezChild 192.168.179.151 60298
>> container_1412235368760_0122_01_000002 application_1412235368760_0122 1
>> 1>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412
>> 235368760_0122_01_000002/stdout
>> 2>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412
>> 235368760_0122_01_000002/stderr
>> yarn     27046 26964  1 16:11 ?        00:01:49
>> /usr/local/sun-jdk/bin/java -Xmx819m -server
>> -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc
>> -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC
>> -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true
>> -Djava.awt.headless=true
>> -Dlog4j.configuration=tez-container-log4j.properties
>> -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223
>> 5368760_0122/container_1412235368760_0122_01_000002
>> -Dtez.root.logger=INFO,CLA
>> -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141
>> 2235368760_0122/container_1412235368760_0122_01_000002/tmp
>> org.apache.tez.runtime.task.TezChild 192.168.179.151 60298
>> container_1412235368760_0122_01_000002 application_1412235368760_0122 1
>> [root@cmaster ~]# jps
>> 28254 NodeManager
>> 13597 -- process information unavailable
>> 5757 -- process information unavailable
>> 28723 start.jar
>> 17192 HMaster
>> 27261 JournalNode
>> 27046 TezChild
>> 6154 Jps
>> 17043 QuorumPeerMain
>> 27418 NameNode
>> 28057 JobHistoryServer
>> 27716 ResourceManager
>> 27094 DataNode
>> 
>> Looks like there is one TezChild which is a child process from the
>> NodeManager. We cannot get any thread dump from this one.
>> And the 2nd TezChild is a child process from the first TezChild.
>> 
>> Find the thread dump of the 2nd TezChild at the bottom of this mail.
>> Any idea whats going on here ?
>> 
>> best
>> Johannes
>> 
>> -------------------------------------
>> Attaching to process ID 27046, please wait...
>> Debugger attached successfully.
>> Server compiler detected.
>> JVM version is 24.55-b03
>> Deadlock Detection:
>> 
>> No deadlocks found.
>> 
>> Thread 4399: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
>> may be imprecise)
>> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object,
>> long) @bci=20, line=226 (Compiled frame)
>> -
>> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util
>> .concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=174,
>> line=460 (Compiled frame)
>> -
>> java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Obj
>> ect, boolean, long) @bci=102, line=359 (Compiled frame)
>> - java.util.concurrent.SynchronousQueue.poll(long,
>> java.util.concurrent.TimeUnit) @bci=11, line=942 (Compiled frame)
>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068
>> (Compiled frame)
>> -
>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>> eadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame)
>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>> (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27089: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>> line=186 (Interpreted frame)
>> -
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>> t() @bci=42, line=2043 (Interpreted frame)
>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>> (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>> (Interpreted frame)
>> -
>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>> (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27088: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>> line=186 (Interpreted frame)
>> -
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>> t() @bci=42, line=2043 (Interpreted frame)
>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>> (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>> (Interpreted frame)
>> -
>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>> (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27087: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>> line=186 (Interpreted frame)
>> -
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>> t() @bci=42, line=2043 (Interpreted frame)
>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>> (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>> (Interpreted frame)
>> -
>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>> (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27086: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>> line=186 (Interpreted frame)
>> -
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>> t() @bci=42, line=2043 (Interpreted frame)
>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>> (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>> (Interpreted frame)
>> -
>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>> (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27084: (state = BLOCKED)
>> - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
>> - org.apache.hadoop.hdfs.PeerCache.run() @bci=41, line=245 (Interpreted
>> frame)
>> -
>> org.apache.hadoop.hdfs.PeerCache.access$000(org.apache.hadoop.hdfs.PeerCac
>> he) @bci=1, line=41 (Interpreted frame)
>> - org.apache.hadoop.hdfs.PeerCache$1.run() @bci=4, line=119 (Interpreted
>> frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27082: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>> line=186 (Interpreted frame)
>> -
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>> t() @bci=42, line=2043 (Interpreted frame)
>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>> (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>> (Interpreted frame)
>> -
>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>> (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27080: (state = BLOCKED)
>> - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
>> - org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=429, line=438
>> (Interpreted frame)
>> -
>> org.apache.hadoop.hdfs.LeaseRenewer.access$700(org.apache.hadoop.hdfs.Leas
>> eRenewer, int) @bci=2, line=71 (Interpreted frame)
>> - org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=69, line=298
>> (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27079: (state = BLOCKED)
>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
>> imprecise)
>> - org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run() @bci=265,
>> line=502 (Interpreted frame)
>> 
>> 
>> Thread 27077: (state = IN_NATIVE)
>> - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int,
>> org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted
>> frame)
>> - org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(int,
>> org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52
>> (Interpreted frame)
>> - org.apache.hadoop.net.unix.DomainSocketWatcher$1.run() @bci=551,
>> line=457 (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27076: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>> line=186 (Interpreted frame)
>> -
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>> t() @bci=42, line=2043 (Interpreted frame)
>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>> (Interpreted frame)
>> - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.run() @bci=26,
>> line=668 (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27071: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
>> may be imprecise)
>> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object,
>> long) @bci=20, line=226 (Compiled frame)
>> -
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>> t(long, java.util.concurrent.TimeUnit) @bci=106, line=2176 (Compiled
>> frame)
>> - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call()
>> @bci=61, line=183 (Compiled frame)
>> - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call()
>> @bci=1, line=118 (Interpreted frame)
>> - java.util.concurrent.FutureTask.run() @bci=42, line=262 (Interpreted
>> frame)
>> -
>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>> eadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame)
>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>> (Interpreted frame)
>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>> 
>> 
>> Thread 27065: (state = BLOCKED)
>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
>> imprecise)
>> - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=902
>> (Compiled frame)
>> - org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=947
>> (Compiled frame)
>> 
>> 
>> Thread 27064: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>> line=186 (Interpreted frame)
>> -
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>> t() @bci=42, line=2043 (Interpreted frame)
>> - some user code
>> 
>> 
>> Thread 27063: (state = BLOCKED)
>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
>> imprecise)
>> - java.util.TimerThread.mainLoop() @bci=201, line=552 (Interpreted frame)
>> - java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame)
>> 
>> 
>> Thread 27055: (state = BLOCKED)
>> 
>> 
>> Thread 27054: (state = BLOCKED)
>> - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
>> - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Compiled
>> frame)
>> - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Compiled frame)
>> - java.lang.ref.Finalizer$FinalizerThread.run() @bci=16, line=189
>> (Compiled frame)
>> 
>> 
>> Thread 27053: (state = BLOCKED)
>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
>> imprecise)
>> - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame)
>> - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133
>> (Interpreted frame)
>> 
>> 
>> Thread 27047: (state = BLOCKED)
>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>> line=186 (Interpreted frame)
>> - java.util.concurrent.FutureTask.awaitDone(boolean, long) @bci=165,
>> line=425 (Interpreted frame)
>> - java.util.concurrent.FutureTask.get() @bci=13, line=187 (Interpreted
>> frame)
>> - org.apache.tez.runtime.task.TezTaskRunner.run() @bci=37, line=96
>> (Interpreted frame)
>> - org.apache.tez.runtime.task.TezChild.run() @bci=372, line=218
>> (Interpreted frame)
>> - org.apache.tez.runtime.task.TezChild.main(java.lang.String[]) @bci=116,
>> line=433 (Interpreted frame)
>> 
>> -- 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to 
>> which it is addressed and may contain information that is confidential, 
>> privileged and exempt from disclosure under applicable law. If the reader 
>> of this message is not the intended recipient, you are hereby notified that 
>> any printing, copying, dissemination, distribution, disclosure or 
>> forwarding of this communication is strictly prohibited. If you have 
>> received this communication in error, please contact the sender immediately 
>> and delete it from your system. Thank You.
> 

Reply via email to