Thanks Matt for info.
I raised a Jira for this at https://issues.apache.org/jira/browse/HADOOP-3039

Thanks
Amareshwari
Matt Kent wrote:
Or maybe I can't use attachments, so here's the stack traces inline:

--------------------------task tracker----------------------------

2008-03-17 21:58:30
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00002aab1205c400 nid=0x523d waiting on condition [0x0000000000000000..0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"IPC Client connection to bigmike.internal.persai.com/192.168.1.3:9001" daemon prio=10 tid=0x00002aab14317000 nid=0x5230 in Object.wait() [0x0000000041c44000..0x0000000041c44ba0]
  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaaf304da08> (a org.apache.hadoop.ipc.Client$Connection)
   at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234) - locked <0x00002aaaf304da08> (a org.apache.hadoop.ipc.Client$Connection)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"process reaper" daemon prio=10 tid=0x00002aab1205bc00 nid=0x51c6 runnable [0x0000000041f47000..0x0000000041f47da0]
  java.lang.Thread.State: RUNNABLE
   at java.lang.UNIXProcess.waitForProcessExit(Native Method)
   at java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
   at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)

"Thread-408" prio=10 tid=0x00002aab14316000 nid=0x51c5 in Object.wait() [0x0000000041d45000..0x0000000041d45a20]
  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x00002aaaf2cf0948> (a java.lang.UNIXProcess)
   at java.lang.Object.wait(Object.java:485)
   at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
   - locked <0x00002aaaf2cf0948> (a java.lang.UNIXProcess)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:152)
   at org.apache.hadoop.util.Shell.run(Shell.java:100)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252)
   at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456)
   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379)

"SocketListener0-0" prio=10 tid=0x00002aab1205e400 nid=0x519d in Object.wait() [0x0000000041038000..0x0000000041038da0]
  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaaf2650c20> (a org.mortbay.util.ThreadPool$PoolThread)
   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
- locked <0x00002aaaf2650c20> (a org.mortbay.util.ThreadPool$PoolThread)

"[EMAIL PROTECTED]" daemon prio=10 tid=0x00002aab183a9000 nid=0x46f5 waiting on condition [0x0000000041a42000..0x0000000041a42aa0]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
   at java.lang.Thread.run(Thread.java:619)

"[EMAIL PROTECTED]" daemon prio=10 tid=0x00002aab183ce000 nid=0x46ef waiting on condition [0x0000000041840000..0x0000000041840c20]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
   at java.lang.Thread.run(Thread.java:619)

"Map-events fetcher for all reduce tasks on tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon prio=10 tid=0x00002aab18438400 nid=0x4631 in Object.wait() [0x000000004173f000..0x000000004173fda0]
  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x00002aaab3f0ace0> (a java.lang.Object)
at org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:534)
   - locked <0x00002aaab3f0ace0> (a java.lang.Object)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 tid=0x00002aab18427400 nid=0x462f waiting on condition [0x000000004153d000..0x000000004153daa0]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)

"IPC Server handler 1 on 43477" daemon prio=10 tid=0x00002aab18476c00 nid=0x462e in Object.wait() [0x000000004143c000..0x000000004143cb20]
  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x00002aaab41356b0> (a java.util.LinkedList)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
   - locked <0x00002aaab41356b0> (a java.util.LinkedList)

"IPC Server handler 0 on 43477" daemon prio=10 tid=0x00002aab18389c00 nid=0x462d in Object.wait() [0x000000004133b000..0x000000004133bba0]
  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x00002aaab41356b0> (a java.util.LinkedList)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
   - locked <0x00002aaab41356b0> (a java.util.LinkedList)

"IPC Server listener on 43477" daemon prio=10 tid=0x00002aab183ddc00 nid=0x462c runnable [0x000000004123a000..0x000000004123ac20]
  java.lang.Thread.State: RUNNABLE
   at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
   at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:184)
   at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
   - locked <0x00002aaab4136b80> (a sun.nio.ch.Util$1)
- locked <0x00002aaab4136b68> (a java.util.Collections$UnmodifiableSet)
   - locked <0x00002aaab41328c8> (a sun.nio.ch.EPollSelectorImpl)
   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
   at org.apache.hadoop.ipc.Server$Listener.run(Server.java:308)

"IPC Server Responder" daemon prio=10 tid=0x00002aab183dd400 nid=0x462b runnable [0x0000000041139000..0x0000000041139ca0]
  java.lang.Thread.State: RUNNABLE
   at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
   at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:184)
   at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
   - locked <0x00002aaab41362d0> (a sun.nio.ch.Util$1)
- locked <0x00002aaab41362b8> (a java.util.Collections$UnmodifiableSet)
   - locked <0x00002aaab4135f10> (a sun.nio.ch.EPollSelectorImpl)
   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
   at org.apache.hadoop.ipc.Server$Responder.run(Server.java:450)

"Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060]" prio=10 tid=0x00002aab1838b400 nid=0x4629 runnable [0x0000000040f37000..0x0000000040f37da0]
  java.lang.Thread.State: RUNNABLE
   at java.net.PlainSocketImpl.socketAccept(Native Method)
   at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
   - locked <0x00002aaab4136e78> (a java.net.SocksSocketImpl)
   at java.net.ServerSocket.implAccept(ServerSocket.java:453)
   at java.net.ServerSocket.accept(ServerSocket.java:421)
at org.mortbay.util.ThreadedServer.acceptSocket(ThreadedServer.java:432) at org.mortbay.util.ThreadedServer$Acceptor.run(ThreadedServer.java:631)

"SessionScavenger" daemon prio=10 tid=0x00002aab183c7400 nid=0x4628 waiting on condition [0x0000000040e36000..0x0000000040e36a20]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
at org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run(AbstractSessionManager.java:587)

"taskCleanup" daemon prio=10 tid=0x00002aab18383000 nid=0x4623 waiting on condition [0x0000000040d35000..0x0000000040d35aa0]
  java.lang.Thread.State: WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00002aaab3f0ac10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
   at org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:300)
   at java.lang.Thread.run(Thread.java:619)

"Low Memory Detector" daemon prio=10 tid=0x00002aab122b7000 nid=0x461f runnable [0x0000000000000000..0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x00002aab122b4800 nid=0x461e waiting on condition [0x0000000000000000..0x0000000040a31c10]
  java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x00002aab122b2c00 nid=0x461d waiting on condition [0x0000000000000000..0x0000000040930c90]
  java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00002aab1209e800 nid=0x461c runnable [0x0000000000000000..0x00000000408309b0]
  java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00002aab12089400 nid=0x461b in Object.wait() [0x000000004072f000..0x000000004072fda0]
  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3e6b448> (a java.lang.ref.ReferenceQueue$Lock)
   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
   - locked <0x00002aaab3e6b448> (a java.lang.ref.ReferenceQueue$Lock)
   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
   at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00002aab12088400 nid=0x461a in Object.wait() [0x000000004062e000..0x000000004062ea20]
  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x00002aaab3e6f298> (a java.lang.ref.Reference$Lock)
   at java.lang.Object.wait(Object.java:485)
   at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
   - locked <0x00002aaab3e6f298> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x0000000040117000 nid=0x4616 in Object.wait() [0x000000004022a000..0x000000004022aee0]
  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x00002aaab3f09ff8> (a [I)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:898)
   - locked <0x00002aaab3f09ff8> (a [I)
   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)

"VM Thread" prio=10 tid=0x00002aab12083800 nid=0x4619 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000040121400 nid=0x4617 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000040122800 nid=0x4618 runnable

"VM Periodic Task Thread" prio=10 tid=0x00002aab122b8c00 nid=0x4620 waiting on condition

JNI global references: 831


----------------------------------------task child--------------------------------------------------

2008-03-17 21:58:45
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00002aaad440f000 nid=0x5254 runnable [0x0000000000000000..0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" prio=10 tid=0x00002aaad8045800 nid=0x51c9 waiting on condition [0x0000000000000000..0x000000004032ad10]
  java.lang.Thread.State: RUNNABLE

"Thread-12" prio=10 tid=0x00002aaad8046400 nid=0x51fc in Object.wait() [0x0000000042940000..0x0000000042940da0]
  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x00002aaac900c9f0> (a java.util.LinkedList)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1577)
   - locked <0x00002aaac900c9f0> (a java.util.LinkedList)

"Comm thread for task_200803172138_0005_r_000003_0" daemon prio=10 tid=0x00002aaad8032400 nid=0x51e3 waiting on condition [0x0000000041b39000..0x0000000041b39d20]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.mapred.Task$1.run(Task.java:282)
   at java.lang.Thread.run(Thread.java:619)

"[EMAIL PROTECTED]" daemon prio=10 tid=0x00002aaad802e800 nid=0x51db waiting on condition [0x0000000041938000..0x0000000041938ca0]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
   at java.lang.Thread.run(Thread.java:619)

"IPC Client connection to bigmike.internal.persai.com/192.168.1.3:9000" daemon prio=10 tid=0x00002aaad8024000 nid=0x51da in Object.wait() [0x0000000041737000..0x0000000041737c20]
  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3f9d468> (a org.apache.hadoop.ipc.Client$Connection)
   at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234) - locked <0x00002aaab3f9d468> (a org.apache.hadoop.ipc.Client$Connection)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"IPC Client connection to /127.0.0.1:43477" daemon prio=10 tid=0x00002aaad800b400 nid=0x51d9 in Object.wait() [0x0000000041536000..0x0000000041536ba0]
  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3f9dc88> (a org.apache.hadoop.ipc.Client$Connection)
   at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234) - locked <0x00002aaab3f9dc88> (a org.apache.hadoop.ipc.Client$Connection)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 tid=0x00002aaad8004000 nid=0x51d8 waiting on condition [0x0000000041335000..0x0000000041335b20]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)

"Low Memory Detector" daemon prio=10 tid=0x00002aaad3776400 nid=0x51d2 runnable [0x0000000000000000..0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x00002aaad3773c00 nid=0x51d1 waiting on condition [0x0000000000000000..0x0000000040e31d00]
  java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x00002aaad3772000 nid=0x51d0 waiting on condition [0x0000000000000000..0x0000000040d30c90]
  java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00002aaad3770c00 nid=0x51cf runnable [0x0000000000000000..0x0000000040c30930]
  java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00002aaad3548800 nid=0x51ce in Object.wait() [0x0000000040a2f000..0x0000000040a2fc20]
  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3e63330> (a java.lang.ref.ReferenceQueue$Lock)
   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
   - locked <0x00002aaab3e63330> (a java.lang.ref.ReferenceQueue$Lock)
   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
   at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00002aaad3547c00 nid=0x51cd in Object.wait() [0x000000004082e000..0x000000004082eba0]
  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x00002aaab3e66f50> (a java.lang.ref.Reference$Lock)
   at java.lang.Object.wait(Object.java:485)
   at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
   - locked <0x00002aaab3e66f50> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x00002aaad3543000 nid=0x51cc runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000040121c00 nid=0x51ca runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000040123000 nid=0x51cb runnable

"VM Periodic Task Thread" prio=10 tid=0x00002aaad3778400 nid=0x51d3 waiting on condition

JNI global references: 764


-------- Original Message --------
Subject:     Re: runtime exceptions not killing job
Date:     Mon, 17 Mar 2008 22:01:44 -0700
From:     Matt Kent <[EMAIL PROTECTED]>
To:     core-user@hadoop.apache.org
References: <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>



It seems to happen only with reduce tasks, not map tasks. I reproduced it by having a dummy reduce task throw an NPE immediately. The error is shown on the reduce details page but the job does not register the task as failed. I've attached the task tracker stack trace, the child stack trace and a screenshot of the task list page.

Matt

Owen O'Malley wrote:

On Mar 17, 2008, at 3:14 PM, Matt Kent wrote:

I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 0.14.1, if a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. I was running on job on my local 0.16.1 cluster today, and when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed, but is this expected behavior in 0.16.1? I'd prefer the job to fail quickly if NPEs are being thrown.

This sounds like a bug. Tasks should certainly fail immediately if an exception is thrown. Do you know where the exception is being thrown? Can you get a stack trace of the task from jstack after the exception and before the task times out?

Thanks,
   Owen




Reply via email to