Re: TaskTrackers disengaging from JobTracker
To complete the picture: not only was our network swamped, I realized tonight that the NameNode/JobTracker was running on a 99% full disk (it hit 100% full about thirty minutes ago). That poor JobTracker was fighting against a lot of odds. As soon as we upgrade to a bigger disk and switch it back on, I'll apply the supplied patch to the cluster. Thank you for looking into this! - Aaron On Thu, Oct 30, 2008 at 3:42 PM, Raghu Angadi [EMAIL PROTECTED] wrote: Raghu Angadi wrote: Devaraj fwded the stacks that Aaron sent. As he suspected there is a deadlock in RPC server. I will file a blocker for 0.18 and above. This deadlock is more likely on a busy network. Aaron, Could you try the patch attached to https://issues.apache.org/jira/browse/HADOOP-4552 ? Thanks, Raghu.
Re: TaskTrackers disengaging from JobTracker
Aaron, Does this have to do with the cloud we got at SoftLayer or is it something at the university like you said? I think I provisioned 6 - 10 with 1G connections let me log in and check really quick.. no they're all correct.. Speed: 1000Mb/s On Wed, Oct 29, 2008 at 9:49 PM, Aaron Kimball [EMAIL PROTECTED] wrote: Just as I wrote that, Murphy's law struck :) This did not fix the issue after all. I think the problem is occurring because a huge amount of network bandwidth is being consumed by the jobs. What settings (timeouts, thread counts, etc), if any, ought I dial up to correct for this? Thanks, - Aaron Aaron Kimball wrote: It's a cluster being used for a university course; there are 30 students all running code which (to be polite) probably tests the limits of Hadoop's failure recovery logic. :) The current assignment is PageRank over Wikipedia; a 20 GB input corpus. Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 reduce tasks. I wrote a patch to address the NPE in JobTracker.killJob() and compiled it against TRUNK. I've put this on the cluster and it's now been holding steady for the last hour or so.. so that plus whatever other differences there are between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the JIRA as soon as it finishes cranking against the JUnit tests) - Aaron Devaraj Das wrote: On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote: The system load and memory consumption on the JT are both very close to idle states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This should not happen. IPC handler catches Throwable and handles that. Could you give more details like the kind of jobs (long/short) you are running, how many tasks they have, etc. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote: It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say 0 nodes available on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.init(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045 ) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j ava:55) at
Re: TaskTrackers disengaging from JobTracker
Devaraj Das wrote: I wrote a patch to address the NPE in JobTracker.killJob() and compiled it against TRUNK. I've put this on the cluster and it's now been holding steady for the last hour or so.. so that plus whatever other differences there are between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the JIRA as soon as it finishes cranking against the JUnit tests) Aaron, I don't think this is a solution to the problem you are seeing. The IPC handlers are tolerant to exceptions. In particular, they must not die in the event of an exception during RPC processing. Could you please get a stack trace of the JobTracker threads (without your patch) when the TTs are unable to talk to it. Access the url http://jt-host:jt-info-port/stacks That will tell us what the handlers are up to. Devaraj fwded the stacks that Aaron sent. As he suspected there is a deadlock in RPC server. I will file a blocker for 0.18 and above. This deadlock is more likely on a busy network. Raghu.
Re: TaskTrackers disengaging from JobTracker
Raghu Angadi wrote: Devaraj fwded the stacks that Aaron sent. As he suspected there is a deadlock in RPC server. I will file a blocker for 0.18 and above. This deadlock is more likely on a busy network. Aaron, Could you try the patch attached to https://issues.apache.org/jira/browse/HADOOP-4552 ? Thanks, Raghu.
TaskTrackers disengaging from JobTracker
Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say 0 nodes available on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.init(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) As well as a few of these warnings: 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)1) on [EMAIL PROTECTED]:50060 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF THREADS: [EMAIL PROTECTED]:50060 The NameNode and DataNodes are completely fine. Can't be a DNS issue, because all DNS is served through /etc/hosts files. NameNode and JobTracker are on the same machine. Any help is appreciated Thanks - Aaron Kimball
Re: TaskTrackers disengaging from JobTracker
It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say 0 nodes available on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763 , but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.init(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org .apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java: 1045) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java: 928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream $Reader.performIO(SocketInputStream.java:55) at org .apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java: 140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection $PingInputStream.read(Client.java:272) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java: 499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) As well as a few of these warnings: 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)1) on [EMAIL PROTECTED]:50060 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF THREADS: [EMAIL PROTECTED]:50060 The NameNode and DataNodes are completely fine. Can't be a DNS issue, because all DNS is served through /etc/hosts files. NameNode and JobTracker are on the same machine. Any help is appreciated Thanks - Aaron Kimball
Re: TaskTrackers disengaging from JobTracker
The system load and memory consumption on the JT are both very close to idle states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote: It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say 0 nodes available on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.init(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) As well as a few of these warnings: 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)1) on [EMAIL PROTECTED]:50060 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF THREADS: [EMAIL PROTECTED]:50060 The NameNode and DataNodes are completely fine. Can't be a DNS issue, because all DNS is served through /etc/hosts files. NameNode and JobTracker are on the same machine. Any help is appreciated Thanks - Aaron Kimball
Re: TaskTrackers disengaging from JobTracker
Could the version of Java being used matter? I just realized this cluster runs IBM Java, not Sun: java version 1.6.0 Java(TM) SE Runtime Environment (build pxi3260sr2-20080818_01(SR2)) IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 Linux x86-32 jvmxi3260-20080816_2209 3 (JIT enabled, AOT enabled) J9VM - 20080816_022093_lHdSMr JIT - r9_20080721_1330ifx2 GC - 20080724_AA) JCL - 20080808_02 - Aaron On Wed, Oct 29, 2008 at 2:43 PM, Aaron Kimball [EMAIL PROTECTED] wrote: The system load and memory consumption on the JT are both very close to idle states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote: It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say 0 nodes available on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.init(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) As well as a few of these warnings: 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)1) on [EMAIL PROTECTED]:50060 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF THREADS: [EMAIL PROTECTED]:50060 The NameNode and DataNodes are completely fine. Can't be a DNS issue, because all DNS is served through /etc/hosts files. NameNode and
Re: TaskTrackers disengaging from JobTracker
On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote: The system load and memory consumption on the JT are both very close to idle states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This should not happen. IPC handler catches Throwable and handles that. Could you give more details like the kind of jobs (long/short) you are running, how many tasks they have, etc. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote: It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say 0 nodes available on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.init(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045 ) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j ava:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272 ) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) As well as a few of these warnings: 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)1) on [EMAIL PROTECTED]:50060 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF THREADS: [EMAIL PROTECTED]:50060 The NameNode and DataNodes are completely fine. Can't be a DNS issue, because all DNS is served through /etc/hosts files. NameNode and JobTracker are on the same machine. Any help is appreciated Thanks - Aaron Kimball
Re: TaskTrackers disengaging from JobTracker
It's a cluster being used for a university course; there are 30 students all running code which (to be polite) probably tests the limits of Hadoop's failure recovery logic. :) The current assignment is PageRank over Wikipedia; a 20 GB input corpus. Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 reduce tasks. I wrote a patch to address the NPE in JobTracker.killJob() and compiled it against TRUNK. I've put this on the cluster and it's now been holding steady for the last hour or so.. so that plus whatever other differences there are between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the JIRA as soon as it finishes cranking against the JUnit tests) - Aaron Devaraj Das wrote: On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote: The system load and memory consumption on the JT are both very close to idle states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This should not happen. IPC handler catches Throwable and handles that. Could you give more details like the kind of jobs (long/short) you are running, how many tasks they have, etc. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote: It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say 0 nodes available on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.init(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045 ) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j ava:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272 ) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) As
Re: TaskTrackers disengaging from JobTracker
Just as I wrote that, Murphy's law struck :) This did not fix the issue after all. I think the problem is occurring because a huge amount of network bandwidth is being consumed by the jobs. What settings (timeouts, thread counts, etc), if any, ought I dial up to correct for this? Thanks, - Aaron Aaron Kimball wrote: It's a cluster being used for a university course; there are 30 students all running code which (to be polite) probably tests the limits of Hadoop's failure recovery logic. :) The current assignment is PageRank over Wikipedia; a 20 GB input corpus. Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 reduce tasks. I wrote a patch to address the NPE in JobTracker.killJob() and compiled it against TRUNK. I've put this on the cluster and it's now been holding steady for the last hour or so.. so that plus whatever other differences there are between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the JIRA as soon as it finishes cranking against the JUnit tests) - Aaron Devaraj Das wrote: On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote: The system load and memory consumption on the JT are both very close to idle states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This should not happen. IPC handler catches Throwable and handles that. Could you give more details like the kind of jobs (long/short) you are running, how many tasks they have, etc. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote: It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say 0 nodes available on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.init(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045 ) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j ava:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at
Re: TaskTrackers disengaging from JobTracker
I wrote a patch to address the NPE in JobTracker.killJob() and compiled it against TRUNK. I've put this on the cluster and it's now been holding steady for the last hour or so.. so that plus whatever other differences there are between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the JIRA as soon as it finishes cranking against the JUnit tests) Aaron, I don't think this is a solution to the problem you are seeing. The IPC handlers are tolerant to exceptions. In particular, they must not die in the event of an exception during RPC processing. Could you please get a stack trace of the JobTracker threads (without your patch) when the TTs are unable to talk to it. Access the url http://jt-host:jt-info-port/stacks That will tell us what the handlers are up to. - Aaron Devaraj Das wrote: On 10/30/08 3:13 AM, Aaron Kimball [EMAIL PROTECTED] wrote: The system load and memory consumption on the JT are both very close to idle states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .j ava:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This should not happen. IPC handler catches Throwable and handles that. Could you give more details like the kind of jobs (long/short) you are running, how many tasks they have, etc. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy [EMAIL PROTECTED] wrote: It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say 0 nodes available on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.init(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045 ) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream .j ava:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:14 0) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272 ) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at