Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now:
The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say "0 nodes available" on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.<init>(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) As well as a few of these warnings: 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1) on [EMAIL PROTECTED]:50060 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF THREADS: [EMAIL PROTECTED]:50060 The NameNode and DataNodes are completely fine. Can't be a DNS issue, because all DNS is served through /etc/hosts files. NameNode and JobTracker are on the same machine. Any help is appreciated Thanks - Aaron Kimball