Ok, thanks, thats good to know!

Johannes

On 13 Oct 2014, at 20:18, Hitesh Shah <[email protected]> wrote:

> Hi Johannes,
> 
> To clarify,  are you concerned about the 2 processes with pids 26964 and 
> 27046 ? If yes, the first one is the bash script invoked by the YARN 
> NodeManager to launch the actual TezChild JVM. 
> 
> — HItesh
> 
> On Oct 13, 2014, at 3:18 AM, Johannes Zillmann <[email protected]> 
> wrote:
> 
>> Hey Bikas,
>> 
>> this was reproducible but the hang causes by some business logic in one of 
>> our tasks waiting forever on something.
>> How the task jvm's were arranged looked strange but i not investigate 
>> further since fixing our business logic fixed the issue!
>> 
>> thanks
>> Johannes
>> 
>> On 03 Oct 2014, at 02:39, Bikas Saha <[email protected]> wrote:
>> 
>>> Not sure whats happening. Does this always repro? TezChild does not spawn
>>> another process. It runs tasks in threads inside its own process. Could be
>>> a cluster issue too since YARN is not reporting the containers correctly.
>>> 
>>> Is this a mini cluster? It could be that the test execution shutdown the
>>> minicluster but left the AM and tasks orphaned.
>>> 
>>> Bikas
>>> 
>>> -----Original Message-----
>>> From: Johannes Zillmann [mailto:[email protected]]
>>> Sent: Thursday, October 02, 2014 10:28 AM
>>> To: [email protected]
>>> Subject: 2 processes for the same container both not registered in YARN UI
>>> 
>>> Hey folks,
>>> 
>>> encountering a strange problem. With a certain Tez job we end up with this
>>> situation:
>>> 
>>> - AppMaster starts, launches a container (seen in logs)
>>> - NodeManager spawns the container (seen in logs)
>>> - nothing happens anymore, the AppMaster just sits an waits there forever
>>> - the YARN UI shows 0 running container
>>> - resource-manager and node-manager do not have any trace from the
>>> container except the it traversed to running state
>>> - 2 container JVMS are running:
>>> 
>>> [root@cmaster ~]# ps -ef | grep TezChild
>>> root      6144  5547  0 18:14 pts/2    00:00:00 grep TezChild
>>> yarn     26964 28254  0 16:11 ?        00:00:00 /bin/bash -c
>>> /usr/local/sun-jdk/bin/java  -Xmx819m -server
>>> -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc
>>> -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC
>>> -Dhadoop.metrics.log.level=WARN  -Dapple.awt.UIElement=true
>>> -Djava.awt.headless=true
>>> -Dlog4j.configuration=tez-container-log4j.properties
>>> -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223
>>> 5368760_0122/container_1412235368760_0122_01_000002
>>> -Dtez.root.logger=INFO,CLA
>>> -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141
>>> 2235368760_0122/container_1412235368760_0122_01_000002/tmp
>>> org.apache.tez.runtime.task.TezChild 192.168.179.151 60298
>>> container_1412235368760_0122_01_000002 application_1412235368760_0122 1
>>> 1>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412
>>> 235368760_0122_01_000002/stdout
>>> 2>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412
>>> 235368760_0122_01_000002/stderr
>>> yarn     27046 26964  1 16:11 ?        00:01:49
>>> /usr/local/sun-jdk/bin/java -Xmx819m -server
>>> -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc
>>> -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC
>>> -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true
>>> -Djava.awt.headless=true
>>> -Dlog4j.configuration=tez-container-log4j.properties
>>> -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223
>>> 5368760_0122/container_1412235368760_0122_01_000002
>>> -Dtez.root.logger=INFO,CLA
>>> -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141
>>> 2235368760_0122/container_1412235368760_0122_01_000002/tmp
>>> org.apache.tez.runtime.task.TezChild 192.168.179.151 60298
>>> container_1412235368760_0122_01_000002 application_1412235368760_0122 1
>>> [root@cmaster ~]# jps
>>> 28254 NodeManager
>>> 13597 -- process information unavailable
>>> 5757 -- process information unavailable
>>> 28723 start.jar
>>> 17192 HMaster
>>> 27261 JournalNode
>>> 27046 TezChild
>>> 6154 Jps
>>> 17043 QuorumPeerMain
>>> 27418 NameNode
>>> 28057 JobHistoryServer
>>> 27716 ResourceManager
>>> 27094 DataNode
>>> 
>>> Looks like there is one TezChild which is a child process from the
>>> NodeManager. We cannot get any thread dump from this one.
>>> And the 2nd TezChild is a child process from the first TezChild.
>>> 
>>> Find the thread dump of the 2nd TezChild at the bottom of this mail.
>>> Any idea whats going on here ?
>>> 
>>> best
>>> Johannes
>>> 
>>> -------------------------------------
>>> Attaching to process ID 27046, please wait...
>>> Debugger attached successfully.
>>> Server compiler detected.
>>> JVM version is 24.55-b03
>>> Deadlock Detection:
>>> 
>>> No deadlocks found.
>>> 
>>> Thread 4399: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
>>> may be imprecise)
>>> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object,
>>> long) @bci=20, line=226 (Compiled frame)
>>> -
>>> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util
>>> .concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=174,
>>> line=460 (Compiled frame)
>>> -
>>> java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Obj
>>> ect, boolean, long) @bci=102, line=359 (Compiled frame)
>>> - java.util.concurrent.SynchronousQueue.poll(long,
>>> java.util.concurrent.TimeUnit) @bci=11, line=942 (Compiled frame)
>>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068
>>> (Compiled frame)
>>> -
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>>> eadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame)
>>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>>> (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27089: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>>> line=186 (Interpreted frame)
>>> -
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>>> t() @bci=42, line=2043 (Interpreted frame)
>>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>>> (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>>> (Interpreted frame)
>>> -
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>>> (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27088: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>>> line=186 (Interpreted frame)
>>> -
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>>> t() @bci=42, line=2043 (Interpreted frame)
>>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>>> (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>>> (Interpreted frame)
>>> -
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>>> (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27087: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>>> line=186 (Interpreted frame)
>>> -
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>>> t() @bci=42, line=2043 (Interpreted frame)
>>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>>> (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>>> (Interpreted frame)
>>> -
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>>> (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27086: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>>> line=186 (Interpreted frame)
>>> -
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>>> t() @bci=42, line=2043 (Interpreted frame)
>>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>>> (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>>> (Interpreted frame)
>>> -
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>>> (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27084: (state = BLOCKED)
>>> - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
>>> - org.apache.hadoop.hdfs.PeerCache.run() @bci=41, line=245 (Interpreted
>>> frame)
>>> -
>>> org.apache.hadoop.hdfs.PeerCache.access$000(org.apache.hadoop.hdfs.PeerCac
>>> he) @bci=1, line=41 (Interpreted frame)
>>> - org.apache.hadoop.hdfs.PeerCache$1.run() @bci=4, line=119 (Interpreted
>>> frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27082: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>>> line=186 (Interpreted frame)
>>> -
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>>> t() @bci=42, line=2043 (Interpreted frame)
>>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>>> (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068
>>> (Interpreted frame)
>>> -
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>>> (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27080: (state = BLOCKED)
>>> - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
>>> - org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=429, line=438
>>> (Interpreted frame)
>>> -
>>> org.apache.hadoop.hdfs.LeaseRenewer.access$700(org.apache.hadoop.hdfs.Leas
>>> eRenewer, int) @bci=2, line=71 (Interpreted frame)
>>> - org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=69, line=298
>>> (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27079: (state = BLOCKED)
>>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
>>> imprecise)
>>> - org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run() @bci=265,
>>> line=502 (Interpreted frame)
>>> 
>>> 
>>> Thread 27077: (state = IN_NATIVE)
>>> - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int,
>>> org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted
>>> frame)
>>> - org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(int,
>>> org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52
>>> (Interpreted frame)
>>> - org.apache.hadoop.net.unix.DomainSocketWatcher$1.run() @bci=551,
>>> line=457 (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27076: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>>> line=186 (Interpreted frame)
>>> -
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>>> t() @bci=42, line=2043 (Interpreted frame)
>>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442
>>> (Interpreted frame)
>>> - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.run() @bci=26,
>>> line=668 (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27071: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
>>> may be imprecise)
>>> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object,
>>> long) @bci=20, line=226 (Compiled frame)
>>> -
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>>> t(long, java.util.concurrent.TimeUnit) @bci=106, line=2176 (Compiled
>>> frame)
>>> - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call()
>>> @bci=61, line=183 (Compiled frame)
>>> - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call()
>>> @bci=1, line=118 (Interpreted frame)
>>> - java.util.concurrent.FutureTask.run() @bci=42, line=262 (Interpreted
>>> frame)
>>> -
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr
>>> eadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame)
>>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
>>> (Interpreted frame)
>>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
>>> 
>>> 
>>> Thread 27065: (state = BLOCKED)
>>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
>>> imprecise)
>>> - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=902
>>> (Compiled frame)
>>> - org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=947
>>> (Compiled frame)
>>> 
>>> 
>>> Thread 27064: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>>> line=186 (Interpreted frame)
>>> -
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
>>> t() @bci=42, line=2043 (Interpreted frame)
>>> - some user code
>>> 
>>> 
>>> Thread 27063: (state = BLOCKED)
>>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
>>> imprecise)
>>> - java.util.TimerThread.mainLoop() @bci=201, line=552 (Interpreted frame)
>>> - java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame)
>>> 
>>> 
>>> Thread 27055: (state = BLOCKED)
>>> 
>>> 
>>> Thread 27054: (state = BLOCKED)
>>> - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
>>> - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Compiled
>>> frame)
>>> - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Compiled frame)
>>> - java.lang.ref.Finalizer$FinalizerThread.run() @bci=16, line=189
>>> (Compiled frame)
>>> 
>>> 
>>> Thread 27053: (state = BLOCKED)
>>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
>>> imprecise)
>>> - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame)
>>> - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133
>>> (Interpreted frame)
>>> 
>>> 
>>> Thread 27047: (state = BLOCKED)
>>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>>> line=186 (Interpreted frame)
>>> - java.util.concurrent.FutureTask.awaitDone(boolean, long) @bci=165,
>>> line=425 (Interpreted frame)
>>> - java.util.concurrent.FutureTask.get() @bci=13, line=187 (Interpreted
>>> frame)
>>> - org.apache.tez.runtime.task.TezTaskRunner.run() @bci=37, line=96
>>> (Interpreted frame)
>>> - org.apache.tez.runtime.task.TezChild.run() @bci=372, line=218
>>> (Interpreted frame)
>>> - org.apache.tez.runtime.task.TezChild.main(java.lang.String[]) @bci=116,
>>> line=433 (Interpreted frame)
>>> 
>>> -- 
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to 
>>> which it is addressed and may contain information that is confidential, 
>>> privileged and exempt from disclosure under applicable law. If the reader 
>>> of this message is not the intended recipient, you are hereby notified that 
>>> any printing, copying, dissemination, distribution, disclosure or 
>>> forwarding of this communication is strictly prohibited. If you have 
>>> received this communication in error, please contact the sender immediately 
>>> and delete it from your system. Thank You.
>> 
> 

Reply via email to