Hi,
  We have a 200 node hadoop cluster (0.20.0) and have tweaked namenode and 
datanode handler to 40 and 10. The xcievers also had to be changed to 8192. 
But during the mapred jobs, we are seeing lot of task attempt failures saying 
"connection reset by peer". Following exception are there in the namenode logs. 
The tcp connection failures on the namenode also seems to be high. What could 
be wrong? (I have >6G heap for namenode). Do we need to increase the handlers 
further ?

   Another side effect of this issue is that the SequeceFileOutput of the job 
seems to be corrupted. The next job is not able read some of these 
sequencefiles created by the previous job(eventhough the first job eventually 
succeeds after lot of connection related failures).

namenode exceptions:

java.io.IOException: Connection reset by peer
2009-10-01 00:44:32,329 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
24 on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:32,330 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:32,331 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
29 on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:32,343 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 
on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:34,575 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
21 on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:34,601 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 9000: readAndProcess threw exception java.io.IOException: C
onnection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
2009-10-01 00:44:34,943 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 9000: readAndProcess threw exception java.io.IOException: C
onnection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
2009-10-01 00:44:35,641 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
16 on 9000 caught: java.nio.channels.ClosedChannelException
2009-10-01 00:44:40,380 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 9000: readAndProcess threw exception java.io.IOException: C
onnection reset by peer. Count of bytes read: 0


Exception reading sequence file by the next job:

java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062)
        at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)

 Thanks,
Murali Krishna

Reply via email to