Aaron, Does this have to do with the cloud we got at SoftLayer or is it something at the university like you said?
I think I provisioned 6 - 10 with 1G connections let me log in and check really quick.. no they're all correct.. Speed: 1000Mb/s On Wed, Oct 29, 2008 at 9:49 PM, Aaron Kimball <[EMAIL PROTECTED]> wrote: > Just as I wrote that, Murphy's law struck :) This did not fix the issue > after all. > > I think the problem is occurring because a huge amount of network bandwidth > is being consumed by the jobs. What settings (timeouts, thread counts, etc), > if any, ought I dial up to correct for this? > > Thanks, > - Aaron > > > Aaron Kimball wrote: > >> It's a cluster being used for a university course; there are 30 students >> all running code which (to be polite) probably tests the limits of Hadoop's >> failure recovery logic. :) >> >> The current assignment is PageRank over Wikipedia; a 20 GB input corpus. >> Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 >> reduce tasks. >> >> I wrote a patch to address the NPE in JobTracker.killJob() and compiled it >> against TRUNK. I've put this on the cluster and it's now been holding steady >> for the last hour or so.. so that plus whatever other differences there are >> between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the >> JIRA as soon as it finishes cranking against the JUnit tests) >> >> - Aaron >> >> >> Devaraj Das wrote: >> >>> >>> On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote: >>> >>> The system load and memory consumption on the JT are both very close to >>>> "idle" states -- it's not overworked, I don't think >>>> >>>> I may have an idea of the problem, though. Digging back up a ways into >>>> the >>>> JT logs, I see this: >>>> >>>> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server >>>> handler 4 on 9001, call killJob(job_200810290855_0025) from >>>> 10.1.143.245:48253: error: java.io.IOException: >>>> java.lang.NullPointerException >>>> java.io.IOException: java.lang.NullPointerException >>>> at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) >>>> >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j >>>> >>>> ava:37) >>>> at java.lang.reflect.Method.invoke(Method.java:599) >>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) >>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) >>>> >>>> >>>> >>>> This exception is then repeated for all the IPC server handlers. So I >>>> think >>>> the problem is that all the handler threads are dying one by one due to >>>> this >>>> NPE. >>>> >>>> >>> This should not happen. IPC handler catches Throwable and handles that. >>> Could you give more details like the kind of jobs (long/short) you are >>> running, how many tasks they have, etc. >>> >>> This something I can fix myself, or is a patch available? >>>> >>>> - Aaron >>>> >>>> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> It's possible that the JobTracker is under duress and unable to respond >>>>> to >>>>> the TaskTrackers... what do the JobTracker logs say? >>>>> >>>>> Arun >>>>> >>>>> >>>>> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: >>>>> >>>>> Hi all, >>>>> >>>>>> I'm working with a 30 node Hadoop cluster that has just started >>>>>> demonstrating some weird behavior. It's run without incident for a few >>>>>> weeks.. and now: >>>>>> >>>>>> The cluster will run smoothly for 90--120 minutes or so, handling jobs >>>>>> continually during this time. Then suddenly it will be the case that >>>>>> all >>>>>> 29 >>>>>> TaskTrackers will get disconnected from the JobTracker. All the >>>>>> tracker >>>>>> daemon processes are still running on each machine; but the JobTracker >>>>>> will >>>>>> say "0 nodes available" on the web status screen. Restarting MapReduce >>>>>> fixes >>>>>> this for another 90--120 minutes. >>>>>> >>>>>> This looks similar to >>>>>> https://issues.apache.org/jira/browse/HADOOP-1763, >>>>>> but >>>>>> we're running on 0.18.1. >>>>>> >>>>>> I found this in a TaskTracker log: >>>>>> >>>>>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: >>>>>> Caught >>>>>> exception: java.io.IOException: Call failed on local exception >>>>>> at java.lang.Throwable.<init>(Throwable.java:67) >>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:718) >>>>>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) >>>>>> at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) >>>>>> at >>>>>> >>>>>> >>>>>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>> >>> >>> ) >>> >>>> at >>>>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) >>>>>> >>>>>> at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) >>>>>> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) >>>>>> Caused by: java.io.IOException: Connection reset by peer >>>>>> at sun.nio.ch.FileDispatcher.read0(Native Method) >>>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) >>>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) >>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:207) >>>>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) >>>>>> at >>>>>> >>>>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j >>>>>> >>>>>> ava:55) >>>>>> at >>>>>> >>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) >>>>>> >>>>>> at >>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) >>>>>> >>>>>> at >>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) >>>>>> >>>>>> at java.io.FilterInputStream.read(FilterInputStream.java:127) >>>>>> at >>>>>> >>>>>> >>>>>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>> >>> >>> ) >>> >>>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) >>>>>> at java.io.BufferedInputStream.read(BufferedInputStream.java:248) >>>>>> at java.io.DataInputStream.readInt(DataInputStream.java:381) >>>>>> at >>>>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) >>>>>> >>>>>> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) >>>>>> >>>>>> >>>>>> As well as a few of these warnings: >>>>>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON >>>>>> THREADS >>>>>> ((40-40+0)<1) on [EMAIL PROTECTED]:50060 >>>>>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF >>>>>> THREADS: [EMAIL PROTECTED]:50060 >>>>>> >>>>>> >>>>>> >>>>>> The NameNode and DataNodes are completely fine. Can't be a DNS issue, >>>>>> because all DNS is served through /etc/hosts files. NameNode and >>>>>> JobTracker >>>>>> are on the same machine. >>>>>> >>>>>> Any help is appreciated >>>>>> Thanks >>>>>> - Aaron Kimball >>>>>> >>>>>> >>>>> >>> >>>