Hi Johannes, To clarify, are you concerned about the 2 processes with pids 26964 and 27046 ? If yes, the first one is the bash script invoked by the YARN NodeManager to launch the actual TezChild JVM.
— HItesh On Oct 13, 2014, at 3:18 AM, Johannes Zillmann <[email protected]> wrote: > Hey Bikas, > > this was reproducible but the hang causes by some business logic in one of > our tasks waiting forever on something. > How the task jvm's were arranged looked strange but i not investigate further > since fixing our business logic fixed the issue! > > thanks > Johannes > > On 03 Oct 2014, at 02:39, Bikas Saha <[email protected]> wrote: > >> Not sure whats happening. Does this always repro? TezChild does not spawn >> another process. It runs tasks in threads inside its own process. Could be >> a cluster issue too since YARN is not reporting the containers correctly. >> >> Is this a mini cluster? It could be that the test execution shutdown the >> minicluster but left the AM and tasks orphaned. >> >> Bikas >> >> -----Original Message----- >> From: Johannes Zillmann [mailto:[email protected]] >> Sent: Thursday, October 02, 2014 10:28 AM >> To: [email protected] >> Subject: 2 processes for the same container both not registered in YARN UI >> >> Hey folks, >> >> encountering a strange problem. With a certain Tez job we end up with this >> situation: >> >> - AppMaster starts, launches a container (seen in logs) >> - NodeManager spawns the container (seen in logs) >> - nothing happens anymore, the AppMaster just sits an waits there forever >> - the YARN UI shows 0 running container >> - resource-manager and node-manager do not have any trace from the >> container except the it traversed to running state >> - 2 container JVMS are running: >> >> [root@cmaster ~]# ps -ef | grep TezChild >> root 6144 5547 0 18:14 pts/2 00:00:00 grep TezChild >> yarn 26964 28254 0 16:11 ? 00:00:00 /bin/bash -c >> /usr/local/sun-jdk/bin/java -Xmx819m -server >> -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc >> -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC >> -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true >> -Djava.awt.headless=true >> -Dlog4j.configuration=tez-container-log4j.properties >> -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223 >> 5368760_0122/container_1412235368760_0122_01_000002 >> -Dtez.root.logger=INFO,CLA >> -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141 >> 2235368760_0122/container_1412235368760_0122_01_000002/tmp >> org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 >> container_1412235368760_0122_01_000002 application_1412235368760_0122 1 >> 1>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412 >> 235368760_0122_01_000002/stdout >> 2>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412 >> 235368760_0122_01_000002/stderr >> yarn 27046 26964 1 16:11 ? 00:01:49 >> /usr/local/sun-jdk/bin/java -Xmx819m -server >> -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc >> -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC >> -Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true >> -Djava.awt.headless=true >> -Dlog4j.configuration=tez-container-log4j.properties >> -Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_141223 >> 5368760_0122/container_1412235368760_0122_01_000002 >> -Dtez.root.logger=INFO,CLA >> -Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_141 >> 2235368760_0122/container_1412235368760_0122_01_000002/tmp >> org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 >> container_1412235368760_0122_01_000002 application_1412235368760_0122 1 >> [root@cmaster ~]# jps >> 28254 NodeManager >> 13597 -- process information unavailable >> 5757 -- process information unavailable >> 28723 start.jar >> 17192 HMaster >> 27261 JournalNode >> 27046 TezChild >> 6154 Jps >> 17043 QuorumPeerMain >> 27418 NameNode >> 28057 JobHistoryServer >> 27716 ResourceManager >> 27094 DataNode >> >> Looks like there is one TezChild which is a child process from the >> NodeManager. We cannot get any thread dump from this one. >> And the 2nd TezChild is a child process from the first TezChild. >> >> Find the thread dump of the 2nd TezChild at the bottom of this mail. >> Any idea whats going on here ? >> >> best >> Johannes >> >> ------------------------------------- >> Attaching to process ID 27046, please wait... >> Debugger attached successfully. >> Server compiler detected. >> JVM version is 24.55-b03 >> Deadlock Detection: >> >> No deadlocks found. >> >> Thread 4399: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information >> may be imprecise) >> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, >> long) @bci=20, line=226 (Compiled frame) >> - >> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util >> .concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=174, >> line=460 (Compiled frame) >> - >> java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Obj >> ect, boolean, long) @bci=102, line=359 (Compiled frame) >> - java.util.concurrent.SynchronousQueue.poll(long, >> java.util.concurrent.TimeUnit) @bci=11, line=942 (Compiled frame) >> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 >> (Compiled frame) >> - >> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >> eadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame) >> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >> (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27089: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >> line=186 (Interpreted frame) >> - >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >> t() @bci=42, line=2043 (Interpreted frame) >> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >> (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >> (Interpreted frame) >> - >> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >> (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27088: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >> line=186 (Interpreted frame) >> - >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >> t() @bci=42, line=2043 (Interpreted frame) >> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >> (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >> (Interpreted frame) >> - >> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >> (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27087: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >> line=186 (Interpreted frame) >> - >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >> t() @bci=42, line=2043 (Interpreted frame) >> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >> (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >> (Interpreted frame) >> - >> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >> (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27086: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >> line=186 (Interpreted frame) >> - >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >> t() @bci=42, line=2043 (Interpreted frame) >> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >> (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >> (Interpreted frame) >> - >> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >> (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27084: (state = BLOCKED) >> - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) >> - org.apache.hadoop.hdfs.PeerCache.run() @bci=41, line=245 (Interpreted >> frame) >> - >> org.apache.hadoop.hdfs.PeerCache.access$000(org.apache.hadoop.hdfs.PeerCac >> he) @bci=1, line=41 (Interpreted frame) >> - org.apache.hadoop.hdfs.PeerCache$1.run() @bci=4, line=119 (Interpreted >> frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27082: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >> line=186 (Interpreted frame) >> - >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >> t() @bci=42, line=2043 (Interpreted frame) >> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >> (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 >> (Interpreted frame) >> - >> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >> eadPoolExecutor$Worker) @bci=26, line=1130 (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >> (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27080: (state = BLOCKED) >> - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) >> - org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=429, line=438 >> (Interpreted frame) >> - >> org.apache.hadoop.hdfs.LeaseRenewer.access$700(org.apache.hadoop.hdfs.Leas >> eRenewer, int) @bci=2, line=71 (Interpreted frame) >> - org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=69, line=298 >> (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27079: (state = BLOCKED) >> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be >> imprecise) >> - org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run() @bci=265, >> line=502 (Interpreted frame) >> >> >> Thread 27077: (state = IN_NATIVE) >> - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int, >> org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted >> frame) >> - org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(int, >> org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52 >> (Interpreted frame) >> - org.apache.hadoop.net.unix.DomainSocketWatcher$1.run() @bci=551, >> line=457 (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27076: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >> line=186 (Interpreted frame) >> - >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >> t() @bci=42, line=2043 (Interpreted frame) >> - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 >> (Interpreted frame) >> - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.run() @bci=26, >> line=668 (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27071: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information >> may be imprecise) >> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, >> long) @bci=20, line=226 (Compiled frame) >> - >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >> t(long, java.util.concurrent.TimeUnit) @bci=106, line=2176 (Compiled >> frame) >> - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() >> @bci=61, line=183 (Compiled frame) >> - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() >> @bci=1, line=118 (Interpreted frame) >> - java.util.concurrent.FutureTask.run() @bci=42, line=262 (Interpreted >> frame) >> - >> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.Thr >> eadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) >> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 >> (Interpreted frame) >> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) >> >> >> Thread 27065: (state = BLOCKED) >> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be >> imprecise) >> - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=902 >> (Compiled frame) >> - org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=947 >> (Compiled frame) >> >> >> Thread 27064: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >> line=186 (Interpreted frame) >> - >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >> t() @bci=42, line=2043 (Interpreted frame) >> - some user code >> >> >> Thread 27063: (state = BLOCKED) >> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be >> imprecise) >> - java.util.TimerThread.mainLoop() @bci=201, line=552 (Interpreted frame) >> - java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame) >> >> >> Thread 27055: (state = BLOCKED) >> >> >> Thread 27054: (state = BLOCKED) >> - java.lang.Object.wait(long) @bci=0 (Interpreted frame) >> - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Compiled >> frame) >> - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Compiled frame) >> - java.lang.ref.Finalizer$FinalizerThread.run() @bci=16, line=189 >> (Compiled frame) >> >> >> Thread 27053: (state = BLOCKED) >> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be >> imprecise) >> - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame) >> - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133 >> (Interpreted frame) >> >> >> Thread 27047: (state = BLOCKED) >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) >> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, >> line=186 (Interpreted frame) >> - java.util.concurrent.FutureTask.awaitDone(boolean, long) @bci=165, >> line=425 (Interpreted frame) >> - java.util.concurrent.FutureTask.get() @bci=13, line=187 (Interpreted >> frame) >> - org.apache.tez.runtime.task.TezTaskRunner.run() @bci=37, line=96 >> (Interpreted frame) >> - org.apache.tez.runtime.task.TezChild.run() @bci=372, line=218 >> (Interpreted frame) >> - org.apache.tez.runtime.task.TezChild.main(java.lang.String[]) @bci=116, >> line=433 (Interpreted frame) >> >> -- >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity to >> which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the reader >> of this message is not the intended recipient, you are hereby notified that >> any printing, copying, dissemination, distribution, disclosure or >> forwarding of this communication is strictly prohibited. If you have >> received this communication in error, please contact the sender immediately >> and delete it from your system. Thank You. >
