It's a cluster being used for a university course; there are 30 students all running code which (to be polite) probably tests the limits of Hadoop's failure recovery logic. :)

The current assignment is PageRank over Wikipedia; a 20 GB input corpus. Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 reduce tasks.

I wrote a patch to address the NPE in JobTracker.killJob() and compiled it against TRUNK. I've put this on the cluster and it's now been holding steady for the last hour or so.. so that plus whatever other differences there are between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the JIRA as soon as it finishes cranking against the JUnit tests)

- Aaron


Devaraj Das wrote:

On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:

The system load and memory consumption on the JT are both very close to
"idle" states -- it's not overworked, I don't think

I may have an idea of the problem, though. Digging back up a ways into the
JT logs, I see this:

2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001, call killJob(job_200810290855_0025) from
10.1.143.245:48253: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
ava:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



This exception is then repeated for all the IPC server handlers. So I think
the problem is that all the handler threads are dying one by one due to this
NPE.


This should not happen. IPC handler catches Throwable and handles that.
Could you give more details like the kind of jobs (long/short) you are
running, how many tasks they have, etc.

This something I can fix myself, or is a patch available?

- Aaron

On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

It's possible that the JobTracker is under duress and unable to respond to
the TaskTrackers... what do the JobTracker logs say?

Arun


On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

 Hi all,
I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that all
29
TaskTrackers will get disconnected from the JobTracker. All the tracker
daemon processes are still running on each machine; but the JobTracker
will
say "0 nodes available" on the web status screen. Restarting MapReduce
fixes
this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call failed on local exception
  at java.lang.Throwable.<init>(Throwable.java:67)
  at org.apache.hadoop.ipc.Client.call(Client.java:718)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
  at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
  at


org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>>
)
  at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
  at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcher.read0(Native Method)
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
  at sun.nio.ch.IOUtil.read(IOUtil.java:207)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
  at

org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
ava:55)
  at

org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
  at java.io.FilterInputStream.read(FilterInputStream.java:127)
  at


org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>>
)
  at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
  at java.io.DataInputStream.readInt(DataInputStream.java:381)
  at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
  at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As well as a few of these warnings:
2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
THREADS
((40-40+0)<1) on [EMAIL PROTECTED]:50060
2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
THREADS: [EMAIL PROTECTED]:50060



The NameNode and DataNodes are completely fine. Can't be a DNS issue,
because all DNS is served through /etc/hosts files. NameNode and
JobTracker
are on the same machine.

Any help is appreciated
Thanks
- Aaron Kimball




Reply via email to