Hey Bikas, this was reproducible but the hang causes by some business logic in one of our tasks waiting forever on something. How the task jvm's were arranged looked strange but i not investigate further since fixing our business logic fixed the issue!
thanks Johannes On 03 Oct 2014, at 02:39, Bikas Saha <[email protected]> wrote: > Not sure whats happening. Does this always repro? TezChild does not spawn > another process. It runs tasks in threads inside its own process. Could be > a cluster issue too since YARN is not reporting the containers correctly. > > Is this a mini cluster? It could be that the test execution shutdown the > minicluster but left the AM and tasks orphaned. > > Bikas > > -----Original Message----- > From: Johannes Zillmann [mailto:[email protected]] > Sent: Thursday, October 02, 2014 10:28 AM > To: [email protected] > Subject: 2 processes for the same container both not registered in YARN UI > > Hey folks, > > encountering a strange problem. With a certain Tez job we end up with this > situation: > > - AppMaster starts, launches a container (seen in logs) > - NodeManager spawns the container (seen in logs) > - nothing happens anymore, the AppMaster just sits an waits there forever > - the YARN UI shows 0 running container > - resource-manager and node-manager do not have any trace from the > container except the it traversed to running state > - 2 container JVMS are running: > > [root@cmaster ~]# ps -ef | grep TezChild > root 6144 5547 0 18:14 pts/2 00:00:00 grep TezChild > yarn 26964 28254 0 16:11 ? 00:00:00 /bin/bash -c > /usr/local/sun-jdk/bin/java -Xmx819m -server > -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc > -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC > -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true > -Djava.awt.headless=true > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223 > 5368760_0122/container_1412235368760_0122_01_000002 > -Dtez.root.logger=INFO,CLA > -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141 > 2235368760_0122/container_1412235368760_0122_01_000002/tmp > org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 > container_1412235368760_0122_01_000002 application_1412235368760_0122 1 > 1>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412 > 235368760_0122_01_000002/stdout > 2>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412 > 235368760_0122_01_000002/stderr > yarn 27046 26964 1 16:11 ? 00:01:49 > /usr/local/sun-jdk/bin/java -Xmx819m -server > -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc > -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC > -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true > -Djava.awt.headless=true > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223 > 5368760_0122/container_1412235368760_0122_01_000002 > -Dtez.root.logger=INFO,CLA > -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141 > 2235368760_0122/container_1412235368760_0122_01_000002/tmp > org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 > container_1412235368760_0122_01_000002 application_1412235368760_0122 1 > [root@cmaster ~]# jps > 28254 NodeManager > 13597 -- process information unavailable > 5757 -- process information unavailable > 28723 start.jar > 17192 HMaster > 27261 JournalNode > 27046 TezChild > 6154 Jps > 17043 QuorumPeerMain > 27418 NameNode > 28057 JobHistoryServer > 27716 ResourceManager > 27094 DataNode > > Looks like there is one TezChild which is a child process from the > NodeManager. We cannot get any thread dump from this one. > And the 2nd TezChild is a child process from the first TezChild. > > Find the thread dump of the 2nd TezChild at the bottom of this mail. > Any idea whats going on here ? > > best > Johannes > > ------------------------------------- > Attaching to process ID 27046, please wait... > Debugger attached successfully. > Server compiler detected. > JVM version is 24.55-b03 > Deadlock Detection: > > No deadlocks found. > > Thread 4399: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, > long) @bci=20, line=226 (Compiled frame) > - > java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util > .concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=174, > line=460 (Compiled frame) > - > java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Obj > ect, boolean, long) @bci=102, line=359 (Compiled frame) > - java.util.concurrent.SynchronousQueue.poll(long, > java.util.concurrent.TimeUnit) @bci=11, line=942 (Compiled frame) > - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 > (Compiled frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr > eadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27089: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=186 (Interpreted frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai > t() @bci=42, line=2043 (Interpreted frame) > - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 > (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 > (Interpreted frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr > eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27088: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=186 (Interpreted frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai > t() @bci=42, line=2043 (Interpreted frame) > - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 > (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 > (Interpreted frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr > eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27087: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=186 (Interpreted frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai > t() @bci=42, line=2043 (Interpreted frame) > - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 > (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 > (Interpreted frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr > eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27086: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=186 (Interpreted frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai > t() @bci=42, line=2043 (Interpreted frame) > - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 > (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 > (Interpreted frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr > eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27084: (state = BLOCKED) > - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) > - org.apache.hadoop.hdfs.PeerCache.run() @bci=41, line=245 (Interpreted > frame) > - > org.apache.hadoop.hdfs.PeerCache.access$000(org.apache.hadoop.hdfs.PeerCac > he) @bci=1, line=41 (Interpreted frame) > - org.apache.hadoop.hdfs.PeerCache$1.run() @bci=4, line=119 (Interpreted > frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27082: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=186 (Interpreted frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai > t() @bci=42, line=2043 (Interpreted frame) > - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 > (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 > (Interpreted frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr > eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27080: (state = BLOCKED) > - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) > - org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=429, line=438 > (Interpreted frame) > - > org.apache.hadoop.hdfs.LeaseRenewer.access$700(org.apache.hadoop.hdfs.Leas > eRenewer, int) @bci=2, line=71 (Interpreted frame) > - org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=69, line=298 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27079: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > imprecise) > - org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run() @bci=265, > line=502 (Interpreted frame) > > > Thread 27077: (state = IN_NATIVE) > - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int, > org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted > frame) > - org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(int, > org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52 > (Interpreted frame) > - org.apache.hadoop.net.unix.DomainSocketWatcher$1.run() @bci=551, > line=457 (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27076: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=186 (Interpreted frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai > t() @bci=42, line=2043 (Interpreted frame) > - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 > (Interpreted frame) > - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.run() @bci=26, > line=668 (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27071: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, > long) @bci=20, line=226 (Compiled frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai > t(long, java.util.concurrent.TimeUnit) @bci=106, line=2176 (Compiled > frame) > - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() > @bci=61, line=183 (Compiled frame) > - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() > @bci=1, line=118 (Interpreted frame) > - java.util.concurrent.FutureTask.run() @bci=42, line=262 (Interpreted > frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr > eadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > > > Thread 27065: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > imprecise) > - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=902 > (Compiled frame) > - org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=947 > (Compiled frame) > > > Thread 27064: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=186 (Interpreted frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai > t() @bci=42, line=2043 (Interpreted frame) > - some user code > > > Thread 27063: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > imprecise) > - java.util.TimerThread.mainLoop() @bci=201, line=552 (Interpreted frame) > - java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame) > > > Thread 27055: (state = BLOCKED) > > > Thread 27054: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Interpreted frame) > - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Compiled > frame) > - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Compiled frame) > - java.lang.ref.Finalizer$FinalizerThread.run() @bci=16, line=189 > (Compiled frame) > > > Thread 27053: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > imprecise) > - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame) > - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133 > (Interpreted frame) > > > Thread 27047: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=186 (Interpreted frame) > - java.util.concurrent.FutureTask.awaitDone(boolean, long) @bci=165, > line=425 (Interpreted frame) > - java.util.concurrent.FutureTask.get() @bci=13, line=187 (Interpreted > frame) > - org.apache.tez.runtime.task.TezTaskRunner.run() @bci=37, line=96 > (Interpreted frame) > - org.apache.tez.runtime.task.TezChild.run() @bci=372, line=218 > (Interpreted frame) > - org.apache.tez.runtime.task.TezChild.main(java.lang.String[]) @bci=116, > line=433 (Interpreted frame) > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.
