Re: Merging of the local FS files threw an exception

2008-10-02 Thread Per Jacobsson
Quick FYI: I've run the same job twice more without seeing the error.
/ Per

On Wed, Oct 1, 2008 at 11:07 AM, Per Jacobsson [EMAIL PROTECTED] wrote:

 Hi everyone,
 (apologies if this gets posted on the list twice for some reason, my first
 attempt was denied as suspected spam)

 I ran a job last night with Hadoop 0.18.0 on EC2, using the standard small
 AMI. The job was producing gzipped output, otherwise I haven't changed the
 configuration.

 The final reduce steps failed with this error that I haven't seem before:

 2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask:
 attempt_200809301822_0005_r_01_0 Merging of the local FS files threw an
 exception: java.io.IOException: java.io.IOException: Rec# 289050: Negative
 value-length: -96
 at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:331)
 at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:134)
 at
 org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:225)
 at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242)
 at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:83)
 at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2021)
 at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2025)

 2008-10-01 05:02:44,131 WARN org.apache.hadoop.mapred.TaskTracker: Error
 running child
 java.io.IOException: attempt_200809301822_0005_r_01_0The reduce copier
 failed
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
 at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)

 When I try to download the data from HDFS I get a Found checksum error
 warning message.

 Any ideas what could be the cause? Would upgrading to 0.18.1 solve it?
 Thanks,
 / Per




Re: Merging of the local FS files threw an exception

2008-10-01 Thread Arun C Murthy

On Oct 1, 2008, at 11:07 AM, Per Jacobsson wrote:
I ran a job last night with Hadoop 0.18.0 on EC2, using the standard  
small
AMI. The job was producing gzipped output, otherwise I haven't  
changed the

configuration.

The final reduce steps failed with this error that I haven't seem  
before:


2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200809301822_0005_r_01_0 Merging of the local FS files  
threw an
exception: java.io.IOException: java.io.IOException: Rec# 289050:  
Negative

value-length: -96


Do you still have the task logs for the reduce?

I suspect are running into http://issues.apache.org/jira/browse/HADOOP-3647 
 which we never could reproduce reliably to pin it down or fix.


However, in light of http://issues.apache.org/jira/browse/HADOOP-4277  
we suspect this could be caused by a bug in the LocalFileSystem which  
could hide data-corruption on your local disk leading to errors on  
these nature. Could you try running your job with that patch once the  
release 0.18.2 is available?


Any information you provide could greatly aid to confirm our above  
hypothesis, so it's much appreciated!


Arun



Re: Merging of the local FS files threw an exception

2008-10-01 Thread Per Jacobsson
I've collected the syslogs from the failed reduce jobs.  What's the best way
to get them to you? Let me know if you need anything else, I'll have to shut
down these instances some time later today.

Overall I've run this same job before with no problems. The only change is
the added gzip of the output. Don't know if it's worth anything, but the
four failures all happened on different machines. I'll be running this job
plenty of times so if the problem keeps happening it will be obvious.
/ Per

On Wed, Oct 1, 2008 at 11:23 AM, Arun C Murthy [EMAIL PROTECTED] wrote:


 Do you still have the task logs for the reduce?

 I suspect are running into
 http://issues.apache.org/jira/browse/HADOOP-3647 which we never could
 reproduce reliably to pin it down or fix.

 However, in light of http://issues.apache.org/jira/browse/HADOOP-4277 we
 suspect this could be caused by a bug in the LocalFileSystem which could
 hide data-corruption on your local disk leading to errors on these nature.
 Could you try running your job with that patch once the release 0.18.2 is
 available?

 Any information you provide could greatly aid to confirm our above
 hypothesis, so it's much appreciated!

 Arun




Re: Merging of the local FS files threw an exception

2008-10-01 Thread Arun C Murthy


On Oct 1, 2008, at 12:04 PM, Per Jacobsson wrote:

I've collected the syslogs from the failed reduce jobs.  What's the  
best way
to get them to you? Let me know if you need anything else, I'll have  
to shut

down these instances some time later today.



Could you please attach them to the jira: http://issues.apache.org/jira/browse/HADOOP-3647? 
 Thanks!


Arun

Overall I've run this same job before with no problems. The only  
change is
the added gzip of the output. Don't know if it's worth anything, but  
the
four failures all happened on different machines. I'll be running  
this job

plenty of times so if the problem keeps happening it will be obvious.
/ Per



With 0.18 we rewrote the path from the output of the map, shuffle and  
the merge on the reducer. So, that could be a bug - again, we hope http://issues.apache.org/jira/browse/HADOOP-4277 
 will fix this.


Arun

On Wed, Oct 1, 2008 at 11:23 AM, Arun C Murthy [EMAIL PROTECTED]  
wrote:




Do you still have the task logs for the reduce?

I suspect are running into
http://issues.apache.org/jira/browse/HADOOP-3647 which we never could
reproduce reliably to pin it down or fix.

However, in light of http://issues.apache.org/jira/browse/ 
HADOOP-4277 we
suspect this could be caused by a bug in the LocalFileSystem which  
could
hide data-corruption on your local disk leading to errors on these  
nature.
Could you try running your job with that patch once the release  
0.18.2 is

available?

Any information you provide could greatly aid to confirm our above
hypothesis, so it's much appreciated!

Arun






Re: Merging of the local FS files threw an exception

2008-10-01 Thread Per Jacobsson
Attached to the ticket. Hope this helps.
/ Per

On Wed, Oct 1, 2008 at 1:33 PM, Arun C Murthy [EMAIL PROTECTED] wrote:


 On Oct 1, 2008, at 12:04 PM, Per Jacobsson wrote:

  I've collected the syslogs from the failed reduce jobs.  What's the best
 way
 to get them to you? Let me know if you need anything else, I'll have to
 shut
 down these instances some time later today.


 Could you please attach them to the jira:
 http://issues.apache.org/jira/browse/HADOOP-3647? Thanks!

 Arun