Hi Marcos. The issue appears to be the following. A reduce task is unable to fetch results from a map task on HDFS. The map task is re-run, but the map task is now unable to retrieve information that it needs to run. Here is the error from the second map task:
java.io.FileNotFoundException: /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456) at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205) at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) I have been having general difficulties with HDFS on EBS, which pointed me in this direction. Does this sound like a possible hypothesis to you? Thanks! Kai Ju P.S. I am migrating off of HDFS on EBS, so I will post back with further results as soon as I have them. On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <mlor...@uci.cu> wrote: > > > El 7/7/2011 8:43 PM, Kai Ju Liu escribió: > > Over the past week or two, I've run into an issue where MapReduce jobs >> hang or fail near completion. The percent completion of both map and >> reduce tasks is often reported as 100%, but the actual number of >> completed tasks is less than the total number. It appears that either >> tasks backtrack and need to be restarted or the last few reduce tasks >> hang interminably on the copy step. >> >> In certain cases, the jobs actually complete. In other cases, I can't >> wait long enough and have to kill the job manually. >> >> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4 >> attached EBS volumes. The instances run Ubuntu 10.04.1 with the >> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0 >> distribution. Has anyone experienced similar behavior in their clusters, >> and if so, had any luck resolving it? Thanks! >> >> Can you post here your NN and DN logs files? > Regards > > Kai Ju >> > > -- > Marcos Luís Ortíz Valmaseda > Software Engineer (UCI) > Linux User # 418229 > http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com> > http://twitter.com/**marcosluis2186 <http://twitter.com/marcosluis2186> > >