Re: TaskTrackers disengaging from JobTracker

Arun C Murthy Wed, 29 Oct 2008 12:57:18 -0700

It's possible that the JobTracker is under duress and unable torespond to the TaskTrackers... what do the JobTracker logs say?


Arun


On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case thatall 29TaskTrackers will get disconnected from the JobTracker. All thetrackerdaemon processes are still running on each machine; but theJobTracker willsay "0 nodes available" on the web status screen. RestartingMapReduce fixes
this for another 90--120 minutes.
This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but
we're running on 0.18.1.

I found this in a TaskTracker log:
2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker:Caught
exception: java.io.IOException: Call failed on local exception
   at java.lang.Throwable.<init>(Throwable.java:67)
   at org.apache.hadoop.ipc.Client.call(Client.java:718)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
   at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045)
   at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
   at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
   at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
   at java.io.FilterInputStream.read(FilterInputStream.java:127)
   at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
   at java.io.DataInputStream.readInt(DataInputStream.java:381)
   at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As well as a few of these warnings:
2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ONTHREADS
((40-40+0)<1) on [EMAIL PROTECTED]:50060
2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
THREADS: [EMAIL PROTECTED]:50060



The NameNode and DataNodes are completely fine. Can't be a DNS issue,
because all DNS is served through /etc/hosts files. NameNode andJobTracker
are on the same machine.

Any help is appreciated
Thanks
- Aaron Kimball

Re: TaskTrackers disengaging from JobTracker

Reply via email to