Or maybe I can't use attachments, so here's the stack traces inline:
--------------------------task tracker----------------------------
2008-03-17 21:58:30
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed
mode):
"Attach Listener" daemon prio=10 tid=0x00002aab1205c400 nid=0x523d
waiting on condition [0x0000000000000000..0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"IPC Client connection to
bigmike.internal.persai.com/192.168.1.3:9001" daemon prio=10
tid=0x00002aab14317000 nid=0x5230 in Object.wait()
[0x0000000041c44000..0x0000000041c44ba0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaaf304da08> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
- locked <0x00002aaaf304da08> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)
"process reaper" daemon prio=10 tid=0x00002aab1205bc00 nid=0x51c6
runnable [0x0000000041f47000..0x0000000041f47da0]
java.lang.Thread.State: RUNNABLE
at java.lang.UNIXProcess.waitForProcessExit(Native Method)
at java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
"Thread-408" prio=10 tid=0x00002aab14316000 nid=0x51c5 in
Object.wait() [0x0000000041d45000..0x0000000041d45a20]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaaf2cf0948> (a java.lang.UNIXProcess)
at java.lang.Object.wait(Object.java:485)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
- locked <0x00002aaaf2cf0948> (a java.lang.UNIXProcess)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:152)
at org.apache.hadoop.util.Shell.run(Shell.java:100)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252)
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379)
"SocketListener0-0" prio=10 tid=0x00002aab1205e400 nid=0x519d in
Object.wait() [0x0000000041038000..0x0000000041038da0]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaaf2650c20> (a
org.mortbay.util.ThreadPool$PoolThread)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
- locked <0x00002aaaf2650c20> (a
org.mortbay.util.ThreadPool$PoolThread)
"[EMAIL PROTECTED]" daemon prio=10
tid=0x00002aab183a9000 nid=0x46f5 waiting on condition
[0x0000000041a42000..0x0000000041a42aa0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
at java.lang.Thread.run(Thread.java:619)
"[EMAIL PROTECTED]" daemon prio=10
tid=0x00002aab183ce000 nid=0x46ef waiting on condition
[0x0000000041840000..0x0000000041840c20]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
at java.lang.Thread.run(Thread.java:619)
"Map-events fetcher for all reduce tasks on
tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon
prio=10 tid=0x00002aab18438400 nid=0x4631 in Object.wait()
[0x000000004173f000..0x000000004173fda0]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3f0ace0> (a java.lang.Object)
at
org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:534)
- locked <0x00002aaab3f0ace0> (a java.lang.Object)
"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x00002aab18427400 nid=0x462f waiting on condition
[0x000000004153d000..0x000000004153daa0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)
"IPC Server handler 1 on 43477" daemon prio=10 tid=0x00002aab18476c00
nid=0x462e in Object.wait() [0x000000004143c000..0x000000004143cb20]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab41356b0> (a java.util.LinkedList)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
- locked <0x00002aaab41356b0> (a java.util.LinkedList)
"IPC Server handler 0 on 43477" daemon prio=10 tid=0x00002aab18389c00
nid=0x462d in Object.wait() [0x000000004133b000..0x000000004133bba0]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab41356b0> (a java.util.LinkedList)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
- locked <0x00002aaab41356b0> (a java.util.LinkedList)
"IPC Server listener on 43477" daemon prio=10 tid=0x00002aab183ddc00
nid=0x462c runnable [0x000000004123a000..0x000000004123ac20]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:184)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x00002aaab4136b80> (a sun.nio.ch.Util$1)
- locked <0x00002aaab4136b68> (a
java.util.Collections$UnmodifiableSet)
- locked <0x00002aaab41328c8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
at org.apache.hadoop.ipc.Server$Listener.run(Server.java:308)
"IPC Server Responder" daemon prio=10 tid=0x00002aab183dd400
nid=0x462b runnable [0x0000000041139000..0x0000000041139ca0]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:184)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x00002aaab41362d0> (a sun.nio.ch.Util$1)
- locked <0x00002aaab41362b8> (a
java.util.Collections$UnmodifiableSet)
- locked <0x00002aaab4135f10> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at org.apache.hadoop.ipc.Server$Responder.run(Server.java:450)
"Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060]"
prio=10 tid=0x00002aab1838b400 nid=0x4629 runnable
[0x0000000040f37000..0x0000000040f37da0]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
- locked <0x00002aaab4136e78> (a java.net.SocksSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:453)
at java.net.ServerSocket.accept(ServerSocket.java:421)
at
org.mortbay.util.ThreadedServer.acceptSocket(ThreadedServer.java:432)
at
org.mortbay.util.ThreadedServer$Acceptor.run(ThreadedServer.java:631)
"SessionScavenger" daemon prio=10 tid=0x00002aab183c7400 nid=0x4628
waiting on condition [0x0000000040e36000..0x0000000040e36a20]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run(AbstractSessionManager.java:587)
"taskCleanup" daemon prio=10 tid=0x00002aab18383000 nid=0x4623 waiting
on condition [0x0000000040d35000..0x0000000040d35aa0]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00002aaab3f0ac10> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
at org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:300)
at java.lang.Thread.run(Thread.java:619)
"Low Memory Detector" daemon prio=10 tid=0x00002aab122b7000 nid=0x461f
runnable [0x0000000000000000..0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"CompilerThread1" daemon prio=10 tid=0x00002aab122b4800 nid=0x461e
waiting on condition [0x0000000000000000..0x0000000040a31c10]
java.lang.Thread.State: RUNNABLE
"CompilerThread0" daemon prio=10 tid=0x00002aab122b2c00 nid=0x461d
waiting on condition [0x0000000000000000..0x0000000040930c90]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00002aab1209e800 nid=0x461c
runnable [0x0000000000000000..0x00000000408309b0]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00002aab12089400 nid=0x461b in
Object.wait() [0x000000004072f000..0x000000004072fda0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3e6b448> (a
java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x00002aaab3e6b448> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x00002aab12088400 nid=0x461a
in Object.wait() [0x000000004062e000..0x000000004062ea20]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3e6f298> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x00002aaab3e6f298> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x0000000040117000 nid=0x4616 in Object.wait()
[0x000000004022a000..0x000000004022aee0]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3f09ff8> (a [I)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:898)
- locked <0x00002aaab3f09ff8> (a [I)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
"VM Thread" prio=10 tid=0x00002aab12083800 nid=0x4619 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000040121400
nid=0x4617 runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000040122800
nid=0x4618 runnable
"VM Periodic Task Thread" prio=10 tid=0x00002aab122b8c00 nid=0x4620
waiting on condition
JNI global references: 831
----------------------------------------task
child--------------------------------------------------
2008-03-17 21:58:45
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed
mode):
"Attach Listener" daemon prio=10 tid=0x00002aaad440f000 nid=0x5254
runnable [0x0000000000000000..0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"DestroyJavaVM" prio=10 tid=0x00002aaad8045800 nid=0x51c9 waiting on
condition [0x0000000000000000..0x000000004032ad10]
java.lang.Thread.State: RUNNABLE
"Thread-12" prio=10 tid=0x00002aaad8046400 nid=0x51fc in Object.wait()
[0x0000000042940000..0x0000000042940da0]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaac900c9f0> (a java.util.LinkedList)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1577)
- locked <0x00002aaac900c9f0> (a java.util.LinkedList)
"Comm thread for task_200803172138_0005_r_000003_0" daemon prio=10
tid=0x00002aaad8032400 nid=0x51e3 waiting on condition
[0x0000000041b39000..0x0000000041b39d20]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.Task$1.run(Task.java:282)
at java.lang.Thread.run(Thread.java:619)
"[EMAIL PROTECTED]" daemon prio=10
tid=0x00002aaad802e800 nid=0x51db waiting on condition
[0x0000000041938000..0x0000000041938ca0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
at java.lang.Thread.run(Thread.java:619)
"IPC Client connection to
bigmike.internal.persai.com/192.168.1.3:9000" daemon prio=10
tid=0x00002aaad8024000 nid=0x51da in Object.wait()
[0x0000000041737000..0x0000000041737c20]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3f9d468> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
- locked <0x00002aaab3f9d468> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)
"IPC Client connection to /127.0.0.1:43477" daemon prio=10
tid=0x00002aaad800b400 nid=0x51d9 in Object.wait()
[0x0000000041536000..0x0000000041536ba0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3f9dc88> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
- locked <0x00002aaab3f9dc88> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)
"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x00002aaad8004000 nid=0x51d8 waiting on condition
[0x0000000041335000..0x0000000041335b20]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)
"Low Memory Detector" daemon prio=10 tid=0x00002aaad3776400 nid=0x51d2
runnable [0x0000000000000000..0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"CompilerThread1" daemon prio=10 tid=0x00002aaad3773c00 nid=0x51d1
waiting on condition [0x0000000000000000..0x0000000040e31d00]
java.lang.Thread.State: RUNNABLE
"CompilerThread0" daemon prio=10 tid=0x00002aaad3772000 nid=0x51d0
waiting on condition [0x0000000000000000..0x0000000040d30c90]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00002aaad3770c00 nid=0x51cf
runnable [0x0000000000000000..0x0000000040c30930]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00002aaad3548800 nid=0x51ce in
Object.wait() [0x0000000040a2f000..0x0000000040a2fc20]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3e63330> (a
java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x00002aaab3e63330> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x00002aaad3547c00 nid=0x51cd
in Object.wait() [0x000000004082e000..0x000000004082eba0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3e66f50> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x00002aaab3e66f50> (a java.lang.ref.Reference$Lock)
"VM Thread" prio=10 tid=0x00002aaad3543000 nid=0x51cc runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000040121c00
nid=0x51ca runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000040123000
nid=0x51cb runnable
"VM Periodic Task Thread" prio=10 tid=0x00002aaad3778400 nid=0x51d3
waiting on condition
JNI global references: 764
-------- Original Message --------
Subject: Re: runtime exceptions not killing job
Date: Mon, 17 Mar 2008 22:01:44 -0700
From: Matt Kent <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
References: <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
It seems to happen only with reduce tasks, not map tasks. I reproduced
it by having a dummy reduce task throw an NPE immediately. The error
is shown on the reduce details page but the job does not register the
task as failed. I've attached the task tracker stack trace, the child
stack trace and a screenshot of the task list page.
Matt
Owen O'Malley wrote:
On Mar 17, 2008, at 3:14 PM, Matt Kent wrote:
I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in
0.14.1, if a map or reduce task threw a runtime exception such as an
NPE, the task, and ultimately the job, would fail in short order. I
was running on job on my local 0.16.1 cluster today, and when the
reduce tasks started throwing NPEs, the tasks just hung. Eventually
they timed out and were killed, but is this expected behavior in
0.16.1? I'd prefer the job to fail quickly if NPEs are being thrown.
This sounds like a bug. Tasks should certainly fail immediately if an
exception is thrown. Do you know where the exception is being thrown?
Can you get a stack trace of the task from jstack after the exception
and before the task times out?
Thanks,
Owen