Is this reproducible? If so, I'd urge you to check your local disks... Arun
On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote: > Hi Marcos. The issue appears to be the following. A reduce task is unable to > fetch results from a map task on HDFS. The map task is re-run, but the map > task is now unable to retrieve information that it needs to run. Here is the > error from the second map task: > java.io.FileNotFoundException: > /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out > at > org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456) > at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205) > at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) > at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) > at org.apache.hadoop.mapred.Child.main(Child.java:262) > > I have been having general difficulties with HDFS on EBS, which pointed me in > this direction. Does this sound like a possible hypothesis to you? Thanks! > > > Kai Ju > > P.S. I am migrating off of HDFS on EBS, so I will post back with further > results as soon as I have them. > On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <mlor...@uci.cu> wrote: > > > El 7/7/2011 8:43 PM, Kai Ju Liu escribió: > > Over the past week or two, I've run into an issue where MapReduce jobs > hang or fail near completion. The percent completion of both map and > reduce tasks is often reported as 100%, but the actual number of > completed tasks is less than the total number. It appears that either > tasks backtrack and need to be restarted or the last few reduce tasks > hang interminably on the copy step. > > In certain cases, the jobs actually complete. In other cases, I can't > wait long enough and have to kill the job manually. > > My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4 > attached EBS volumes. The instances run Ubuntu 10.04.1 with the > 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0 > distribution. Has anyone experienced similar behavior in their clusters, > and if so, had any luck resolving it? Thanks! > > Can you post here your NN and DN logs files? > Regards > > Kai Ju > > -- > Marcos Luís Ortíz Valmaseda > Software Engineer (UCI) > Linux User # 418229 > http://marcosluis2186.posterous.com > http://twitter.com/marcosluis2186 >