Hi,
We have a 200 node hadoop cluster (0.20.0) and have tweaked namenode and
datanode handler to 40 and 10. The xcievers also had to be changed to 8192.
But during the mapred jobs, we are seeing lot of task attempt failures saying
"connection reset by peer". Following exception are there in the namenode logs.
The tcp connection failures on the namenode also seems to be high. What could
be wrong? (I have >6G heap for namenode). Do we need to increase the handlers
further ?
Another side effect of this issue is that the SequeceFileOutput of the job
seems to be corrupted. The next job is not able read some of these
sequencefiles created by the previous job(eventhough the first job eventually
succeeds after lot of connection related failures).
namenode exceptions:
java.io.IOException: Connection reset by peer
2009-10-01 00:44:32,329 INFO org.apache.hadoop.ipc.Server: IPC Server handler
24 on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:32,330 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5
on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:32,331 INFO org.apache.hadoop.ipc.Server: IPC Server handler
29 on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:32,343 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2
on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:34,575 INFO org.apache.hadoop.ipc.Server: IPC Server handler
21 on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:34,601 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 9000: readAndProcess threw exception java.io.IOException: C
onnection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
2009-10-01 00:44:34,943 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 9000: readAndProcess threw exception java.io.IOException: C
onnection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
2009-10-01 00:44:35,641 INFO org.apache.hadoop.ipc.Server: IPC Server handler
16 on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:40,380 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 9000: readAndProcess threw exception java.io.IOException: C
onnection reset by peer. Count of bytes read: 0
Exception reading sequence file by the next job:
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
Thanks,
Murali Krishna