Re: Merging of the local FS files threw an exception
Quick FYI: I've run the same job twice more without seeing the error. / Per On Wed, Oct 1, 2008 at 11:07 AM, Per Jacobsson <[EMAIL PROTECTED]> wrote: > Hi everyone, > (apologies if this gets posted on the list twice for some reason, my first > attempt was denied as "suspected spam") > > I ran a job last night with Hadoop 0.18.0 on EC2, using the standard small > AMI. The job was producing gzipped output, otherwise I haven't changed the > configuration. > > The final reduce steps failed with this error that I haven't seem before: > > 2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask: > attempt_200809301822_0005_r_01_0 Merging of the local FS files threw an > exception: java.io.IOException: java.io.IOException: Rec# 289050: Negative > value-length: -96 > at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:331) > at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:134) > at > org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:225) > at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242) > at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:83) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2021) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2025) > > 2008-10-01 05:02:44,131 WARN org.apache.hadoop.mapred.TaskTracker: Error > running child > java.io.IOException: attempt_200809301822_0005_r_01_0The reduce copier > failed > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) > > When I try to download the data from HDFS I get a "Found checksum error" > warning message. > > Any ideas what could be the cause? Would upgrading to 0.18.1 solve it? > Thanks, > / Per > >
Re: Merging of the local FS files threw an exception
Attached to the ticket. Hope this helps. / Per On Wed, Oct 1, 2008 at 1:33 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > On Oct 1, 2008, at 12:04 PM, Per Jacobsson wrote: > > I've collected the syslogs from the failed reduce jobs. What's the best >> way >> to get them to you? Let me know if you need anything else, I'll have to >> shut >> down these instances some time later today. >> >> > Could you please attach them to the jira: > http://issues.apache.org/jira/browse/HADOOP-3647? Thanks! > > Arun >
Re: Merging of the local FS files threw an exception
On Oct 1, 2008, at 12:04 PM, Per Jacobsson wrote: I've collected the syslogs from the failed reduce jobs. What's the best way to get them to you? Let me know if you need anything else, I'll have to shut down these instances some time later today. Could you please attach them to the jira: http://issues.apache.org/jira/browse/HADOOP-3647? Thanks! Arun Overall I've run this same job before with no problems. The only change is the added gzip of the output. Don't know if it's worth anything, but the four failures all happened on different machines. I'll be running this job plenty of times so if the problem keeps happening it will be obvious. / Per With 0.18 we rewrote the path from the output of the map, shuffle and the merge on the reducer. So, that could be a bug - again, we hope http://issues.apache.org/jira/browse/HADOOP-4277 will fix this. Arun On Wed, Oct 1, 2008 at 11:23 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote: Do you still have the task logs for the reduce? I suspect are running into http://issues.apache.org/jira/browse/HADOOP-3647 which we never could reproduce reliably to pin it down or fix. However, in light of http://issues.apache.org/jira/browse/ HADOOP-4277 we suspect this could be caused by a bug in the LocalFileSystem which could hide data-corruption on your local disk leading to errors on these nature. Could you try running your job with that patch once the release 0.18.2 is available? Any information you provide could greatly aid to confirm our above hypothesis, so it's much appreciated! Arun
Re: Merging of the local FS files threw an exception
I've collected the syslogs from the failed reduce jobs. What's the best way to get them to you? Let me know if you need anything else, I'll have to shut down these instances some time later today. Overall I've run this same job before with no problems. The only change is the added gzip of the output. Don't know if it's worth anything, but the four failures all happened on different machines. I'll be running this job plenty of times so if the problem keeps happening it will be obvious. / Per On Wed, Oct 1, 2008 at 11:23 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > Do you still have the task logs for the reduce? > > I suspect are running into > http://issues.apache.org/jira/browse/HADOOP-3647 which we never could > reproduce reliably to pin it down or fix. > > However, in light of http://issues.apache.org/jira/browse/HADOOP-4277 we > suspect this could be caused by a bug in the LocalFileSystem which could > hide data-corruption on your local disk leading to errors on these nature. > Could you try running your job with that patch once the release 0.18.2 is > available? > > Any information you provide could greatly aid to confirm our above > hypothesis, so it's much appreciated! > > Arun > >
Re: Merging of the local FS files threw an exception
On Oct 1, 2008, at 11:07 AM, Per Jacobsson wrote: I ran a job last night with Hadoop 0.18.0 on EC2, using the standard small AMI. The job was producing gzipped output, otherwise I haven't changed the configuration. The final reduce steps failed with this error that I haven't seem before: 2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200809301822_0005_r_01_0 Merging of the local FS files threw an exception: java.io.IOException: java.io.IOException: Rec# 289050: Negative value-length: -96 Do you still have the task logs for the reduce? I suspect are running into http://issues.apache.org/jira/browse/HADOOP-3647 which we never could reproduce reliably to pin it down or fix. However, in light of http://issues.apache.org/jira/browse/HADOOP-4277 we suspect this could be caused by a bug in the LocalFileSystem which could hide data-corruption on your local disk leading to errors on these nature. Could you try running your job with that patch once the release 0.18.2 is available? Any information you provide could greatly aid to confirm our above hypothesis, so it's much appreciated! Arun