Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that all 29
TaskTrackers will get disconnected from the JobTracker. All the tracker
daemon processes are still running on each machine; but the JobTracker will
say "0 nodes available" on the web status screen. Restarting MapReduce fixes
this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call failed on local exception
    at java.lang.Throwable.<init>(Throwable.java:67)
    at org.apache.hadoop.ipc.Client.call(Client.java:718)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
    at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
    at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045)
    at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
    at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
    at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
    at sun.nio.ch.IOUtil.read(IOUtil.java:207)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
    at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
    at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
    at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
    at java.io.FilterInputStream.read(FilterInputStream.java:127)
    at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
    at java.io.DataInputStream.readInt(DataInputStream.java:381)
    at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As well as a few of these warnings:
2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS
((40-40+0)<1) on [EMAIL PROTECTED]:50060
2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
THREADS: [EMAIL PROTECTED]:50060



The NameNode and DataNodes are completely fine. Can't be a DNS issue,
because all DNS is served through /etc/hosts files. NameNode and JobTracker
are on the same machine.

Any help is appreciated
Thanks
- Aaron Kimball

Reply via email to