Ok, thanks, thats good to know! Johannes
On 13 Oct 2014, at 20:18, Hitesh Shah <[email protected]> wrote: > Hi Johannes, > > To clarify, are you concerned about the 2 processes with pids 26964 and > 27046 ? If yes, the first one is the bash script invoked by the YARN > NodeManager to launch the actual TezChild JVM. > > — HItesh > > On Oct 13, 2014, at 3:18 AM, Johannes Zillmann <[email protected]> > wrote: > >> Hey Bikas, >> >> this was reproducible but the hang causes by some business logic in one of >> our tasks waiting forever on something. >> How the task jvm's were arranged looked strange but i not investigate >> further since fixing our business logic fixed the issue! >> >> thanks >> Johannes >> >> On 03 Oct 2014, at 02:39, Bikas Saha <[email protected]> wrote: >> >>> Not sure whats happening. Does this always repro? TezChild does not spawn >>> another process. It runs tasks in threads inside its own process. Could be >>> a cluster issue too since YARN is not reporting the containers correctly. >>> >>> Is this a mini cluster? It could be that the test execution shutdown the >>> minicluster but left the AM and tasks orphaned. >>> >>> Bikas >>> >>> -----Original Message----- >>> From: Johannes Zillmann [mailto:[email protected]] >>> Sent: Thursday, October 02, 2014 10:28 AM >>> To: [email protected] >>> Subject: 2 processes for the same container both not registered in YARN UI >>> >>> Hey folks, >>> >>> encountering a strange problem. With a certain Tez job we end up with this >>> situation: >>> >>> - AppMaster starts, launches a container (seen in logs) >>> - NodeManager spawns the container (seen in logs) >>> - nothing happens anymore, the AppMaster just sits an waits there forever >>> - the YARN UI shows 0 running container >>> - resource-manager and node-manager do not have any trace from the >>> container except the it traversed to running state >>> - 2 container JVMS are running: >>> >>> [root@cmaster ~]# ps -ef | grep TezChild >>> root 6144 5547 0 18:14 pts/2 00:00:00 grep TezChild >>> yarn 26964 28254 0 16:11 ? 00:00:00 /bin/bash -c >>> /usr/local/sun-jdk/bin/java -Xmx819m -server >>> -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc >>> -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC >>> -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true >>> -Djava.awt.headless=true >>> -Dlog4j.configuration=tez-container-log4j.properties >>> -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223 >>> 5368760_0122/container_1412235368760_0122_01_000002 >>> -Dtez.root.logger=INFO,CLA >>> -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141 >>> 2235368760_0122/container_1412235368760_0122_01_000002/tmp >>> org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 >>> container_1412235368760_0122_01_000002 application_1412235368760_0122 1 >>> 1>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412 >>> 235368760_0122_01_000002/stdout >>> 2>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412 >>> 235368760_0122_01_000002/stderr >>> yarn 27046 26964 1 16:11 ? 00:01:49 >>> /usr/local/sun-jdk/bin/java -Xmx819m -server >>> -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc >>> -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC >>> -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true >>> -Djava.awt.headless=true >>> -Dlog4j.configuration=tez-container-log4j.properties >>> -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223 >>> 5368760_0122/container_1412235368760_0122_01_000002 >>> -Dtez.root.logger=INFO,CLA >>> -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141 >>> 2235368760_0122/container_1412235368760_0122_01_000002/tmp >>> org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 >>> container_1412235368760_0122_01_000002 application_1412235368760_0122 1 >>> [root@cmaster ~]# jps >>> 28254 NodeManager >>> 13597 -- process information unavailable >>> 5757 -- process information unavailable >>> 28723 start.jar >>> 17192 HMaster >>> 27261 JournalNode >>> 27046 TezChild >>> 6154 Jps >>> 17043 QuorumPeerMain >>> 27418 NameNode >>> 28057 JobHistoryServer >>> 27716 ResourceManager >>> 27094 DataNode >>> >>> Looks like there is one TezChild which is a child process from the >>> NodeManager. We cannot get any thread dump from this one. >>> And the 2nd TezChild is a child process from the first TezChild. >>> >>> Find the thread dump of the 2nd TezChild at the bottom of this mail. >>> Any idea whats going on here ? >>> >>> best >>> Johannes >>> >>> ------------------------------------- >>> Attaching to process ID 27046, please wait... >>> Debugger attached successfully. >>> Server compiler detected. >>> JVM version is 24.55-b03 >>> Deadlock Detection: >>> >>> No deadlocks found. >>> >>> Thread 4399: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information >>> may be imprecise) >>> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, >>> long) @bci=20, line=226 (Compiled frame) >>> - >>> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util >>> .concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=174, >>> line=460 (Compiled frame) >>> - >>> java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Obj >>> ect, boolean, long) @bci=102, line=359 (Compiled frame) >>> - java.util.concurrent.SynchronousQueue.poll(long, >>> java.util.concurrent.TimeUnit) @bci=11, line=942 (Compiled frame) >>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 >>> (Compiled frame) >>> - >>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >>> eadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame) >>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >>> (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27089: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >>> line=186 (Interpreted frame) >>> - >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >>> t() @bci=42, line=2043 (Interpreted frame) >>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >>> (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >>> (Interpreted frame) >>> - >>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >>> (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27088: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >>> line=186 (Interpreted frame) >>> - >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >>> t() @bci=42, line=2043 (Interpreted frame) >>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >>> (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >>> (Interpreted frame) >>> - >>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >>> (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27087: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >>> line=186 (Interpreted frame) >>> - >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >>> t() @bci=42, line=2043 (Interpreted frame) >>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >>> (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >>> (Interpreted frame) >>> - >>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >>> (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27086: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >>> line=186 (Interpreted frame) >>> - >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >>> t() @bci=42, line=2043 (Interpreted frame) >>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >>> (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >>> (Interpreted frame) >>> - >>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >>> (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27084: (state = BLOCKED) >>> - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) >>> - org.apache.hadoop.hdfs.PeerCache.run() @bci=41, line=245 (Interpreted >>> frame) >>> - >>> org.apache.hadoop.hdfs.PeerCache.access$000(org.apache.hadoop.hdfs.PeerCac >>> he) @bci=1, line=41 (Interpreted frame) >>> - org.apache.hadoop.hdfs.PeerCache$1.run() @bci=4, line=119 (Interpreted >>> frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27082: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >>> line=186 (Interpreted frame) >>> - >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >>> t() @bci=42, line=2043 (Interpreted frame) >>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >>> (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >>> (Interpreted frame) >>> - >>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >>> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >>> (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27080: (state = BLOCKED) >>> - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) >>> - org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=429, line=438 >>> (Interpreted frame) >>> - >>> org.apache.hadoop.hdfs.LeaseRenewer.access$700(org.apache.hadoop.hdfs.Leas >>> eRenewer, int) @bci=2, line=71 (Interpreted frame) >>> - org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=69, line=298 >>> (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27079: (state = BLOCKED) >>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be >>> imprecise) >>> - org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run() @bci=265, >>> line=502 (Interpreted frame) >>> >>> >>> Thread 27077: (state = IN_NATIVE) >>> - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int, >>> org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted >>> frame) >>> - org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(int, >>> org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52 >>> (Interpreted frame) >>> - org.apache.hadoop.net.unix.DomainSocketWatcher$1.run() @bci=551, >>> line=457 (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27076: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >>> line=186 (Interpreted frame) >>> - >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >>> t() @bci=42, line=2043 (Interpreted frame) >>> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >>> (Interpreted frame) >>> - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.run() @bci=26, >>> line=668 (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27071: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information >>> may be imprecise) >>> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, >>> long) @bci=20, line=226 (Compiled frame) >>> - >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >>> t(long, java.util.concurrent.TimeUnit) @bci=106, line=2176 (Compiled >>> frame) >>> - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() >>> @bci=61, line=183 (Compiled frame) >>> - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() >>> @bci=1, line=118 (Interpreted frame) >>> - java.util.concurrent.FutureTask.run() @bci=42, line=262 (Interpreted >>> frame) >>> - >>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >>> eadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) >>> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >>> (Interpreted frame) >>> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >>> >>> >>> Thread 27065: (state = BLOCKED) >>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be >>> imprecise) >>> - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=902 >>> (Compiled frame) >>> - org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=947 >>> (Compiled frame) >>> >>> >>> Thread 27064: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >>> line=186 (Interpreted frame) >>> - >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >>> t() @bci=42, line=2043 (Interpreted frame) >>> - some user code >>> >>> >>> Thread 27063: (state = BLOCKED) >>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be >>> imprecise) >>> - java.util.TimerThread.mainLoop() @bci=201, line=552 (Interpreted frame) >>> - java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame) >>> >>> >>> Thread 27055: (state = BLOCKED) >>> >>> >>> Thread 27054: (state = BLOCKED) >>> - java.lang.Object.wait(long) @bci=0 (Interpreted frame) >>> - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Compiled >>> frame) >>> - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Compiled frame) >>> - java.lang.ref.Finalizer$FinalizerThread.run() @bci=16, line=189 >>> (Compiled frame) >>> >>> >>> Thread 27053: (state = BLOCKED) >>> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be >>> imprecise) >>> - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame) >>> - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133 >>> (Interpreted frame) >>> >>> >>> Thread 27047: (state = BLOCKED) >>> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >>> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >>> line=186 (Interpreted frame) >>> - java.util.concurrent.FutureTask.awaitDone(boolean, long) @bci=165, >>> line=425 (Interpreted frame) >>> - java.util.concurrent.FutureTask.get() @bci=13, line=187 (Interpreted >>> frame) >>> - org.apache.tez.runtime.task.TezTaskRunner.run() @bci=37, line=96 >>> (Interpreted frame) >>> - org.apache.tez.runtime.task.TezChild.run() @bci=372, line=218 >>> (Interpreted frame) >>> - org.apache.tez.runtime.task.TezChild.main(java.lang.String[]) @bci=116, >>> line=433 (Interpreted frame) >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or entity to >>> which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the reader >>> of this message is not the intended recipient, you are hereby notified that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender immediately >>> and delete it from your system. Thank You. >> >
