It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say?

Arun

On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say "0 nodes available" on the web status screen. Restarting MapReduce fixes
this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763 , but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call failed on local exception
   at java.lang.Throwable.<init>(Throwable.java:67)
   at org.apache.hadoop.ipc.Client.call(Client.java:718)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
   at
org .apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java: 1045)
   at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java: 928)
   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
   at
org.apache.hadoop.net.SocketInputStream $Reader.performIO(SocketInputStream.java:55)
   at
org .apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java: 140)
   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 150)
   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 123)
   at java.io.FilterInputStream.read(FilterInputStream.java:127)
   at
org.apache.hadoop.ipc.Client$Connection $PingInputStream.read(Client.java:272)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
   at java.io.DataInputStream.readInt(DataInputStream.java:381)
   at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java: 499)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As well as a few of these warnings:
2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS
((40-40+0)<1) on [EMAIL PROTECTED]:50060
2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
THREADS: [EMAIL PROTECTED]:50060



The NameNode and DataNodes are completely fine. Can't be a DNS issue,
because all DNS is served through /etc/hosts files. NameNode and JobTracker
are on the same machine.

Any help is appreciated
Thanks
- Aaron Kimball

Reply via email to