I have a MR job that repeatedly fails during a join operation in the Mapper, with the errors "java.io.IOException: Could not obtain block". I'm running on EC2, on a 12 node cluster, provisioned by whirr. Oddly enough on a 5 node cluster the MR job runs through without any problems.
The repeated exception the tasks are reporting in the web UI for this job is: java.io.IOException: Could not obtain block: blk_8346145198855916212_1340 file=/user/someuser/output_6_doc_tf_and_u/part-00002 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1993) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1800) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.sync(SequenceFile.java:2186) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:48) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59) at org.apache.hadoop.mapred.lib.DelegatingInputFormat.getRecordReader(DelegatingInputFormat.java:124) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) When I look at the task log details for this failed job it shows that the DFSClient failed to connect to a datanode that had a replicated copy of this block, and added the datanode ipaddress to the list of deadNodes (exception shown below). 11:25:19,204 INFO DFSClient:1835 - Failed to connect to / 10.114.123.82:50010, add to deadNodes and continue java.io.IOException: Got error in response to OP_READ_BLOCK self=/ 10.202.163.95:43022, remote=/10.114.123.82:50010 for file /user/someuser/output_6_doc_tf_and_u/part-00002 for block 5843350240062345818_1332 at org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1465) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1437) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59) at org.apache.hadoop.mapred.lib.DelegatingInputFormat.getRecordReader(DelegatingInputFormat.java:124) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) It then goes onto try the other two datanodes that contain replicas of this block, each throwing the same exception, and each being added to the list of dead nodes, at which point the task fails. This cycle of failures is happening multiple times during this job, against several different blocks. I then looked in the namenode's log, to see what is going on with datanodes that are getting added to the list of deadNodes, and found them associated with the following error: 2011-07-13 05:33:55,161 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 10.83.109.118:50010 Looking through the rest of the namenode log I count 36 different entries for lost heartbeats. Is this a common error? The odd thing is that after the job fails, hdfs seems to be able to recover itself, bringing these nodes back online and re-replicating the files across the nodes again. So when I browse the hdfs, and look for one of the files that was causing the previous failures, its showing up in the correct directory, with its replication set to 3 Also, I had read this kind of error could be because of the default ulimit -n, so I increased it to Cloudera's recommended value of 16384, but I still have the same issue. Any ideas why I'm getting such unstability with the hdfs? Why these nodes are going down and causing my jobs to fail? Ideas on what direction I should take to trouble shoot this issue? -- Thanks, John C