subject:"Re\: Merging of the local FS files threw an exception"

Re: Merging of the local FS files threw an exception

2008-10-02 Thread Per Jacobsson

Quick FYI: I've run the same job twice more without seeing the error.
/ Per

On Wed, Oct 1, 2008 at 11:07 AM, Per Jacobsson <[EMAIL PROTECTED]> wrote:

> Hi everyone,
> (apologies if this gets posted on the list twice for some reason, my first
> attempt was denied as "suspected spam")
>
> I ran a job last night with Hadoop 0.18.0 on EC2, using the standard small
> AMI. The job was producing gzipped output, otherwise I haven't changed the
> configuration.
>
> The final reduce steps failed with this error that I haven't seem before:
>
> 2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask:
> attempt_200809301822_0005_r_01_0 Merging of the local FS files threw an
> exception: java.io.IOException: java.io.IOException: Rec# 289050: Negative
> value-length: -96
> at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:331)
> at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:134)
> at
> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:225)
> at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242)
> at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:83)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2021)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2025)
>
> 2008-10-01 05:02:44,131 WARN org.apache.hadoop.mapred.TaskTracker: Error
> running child
> java.io.IOException: attempt_200809301822_0005_r_01_0The reduce copier
> failed
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
>
> When I try to download the data from HDFS I get a "Found checksum error"
> warning message.
>
> Any ideas what could be the cause? Would upgrading to 0.18.1 solve it?
> Thanks,
> / Per
>
>

Re: Merging of the local FS files threw an exception

2008-10-01 Thread Per Jacobsson

Attached to the ticket. Hope this helps.
/ Per

On Wed, Oct 1, 2008 at 1:33 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

>
> On Oct 1, 2008, at 12:04 PM, Per Jacobsson wrote:
>
>  I've collected the syslogs from the failed reduce jobs.  What's the best
>> way
>> to get them to you? Let me know if you need anything else, I'll have to
>> shut
>> down these instances some time later today.
>>
>>
> Could you please attach them to the jira:
> http://issues.apache.org/jira/browse/HADOOP-3647? Thanks!
>
> Arun
>

Re: Merging of the local FS files threw an exception

2008-10-01 Thread Arun C Murthy

On Oct 1, 2008, at 12:04 PM, Per Jacobsson wrote:

I've collected the syslogs from the failed reduce jobs. What's the
best way
to get them to you? Let me know if you need anything else, I'll have
to shut

down these instances some time later today.

Could you please attach them to the jira: http://issues.apache.org/jira/browse/HADOOP-3647?
Thanks!

Arun

Overall I've run this same job before with no problems. The only
change is
the added gzip of the output. Don't know if it's worth anything, but
the
four failures all happened on different machines. I'll be running
this job

plenty of times so if the problem keeps happening it will be obvious.
/ Per

With 0.18 we rewrote the path from the output of the map, shuffle and
the merge on the reducer. So, that could be a bug - again, we hope http://issues.apache.org/jira/browse/HADOOP-4277
will fix this.

Arun

On Wed, Oct 1, 2008 at 11:23 AM, Arun C Murthy <[EMAIL PROTECTED]>
wrote:

Do you still have the task logs for the reduce?

I suspect are running into
http://issues.apache.org/jira/browse/HADOOP-3647 which we never could
reproduce reliably to pin it down or fix.

However, in light of http://issues.apache.org/jira/browse/
HADOOP-4277 we
suspect this could be caused by a bug in the LocalFileSystem which
could
hide data-corruption on your local disk leading to errors on these
nature.
Could you try running your job with that patch once the release
0.18.2 is

available?

Any information you provide could greatly aid to confirm our above
hypothesis, so it's much appreciated!

Arun

Re: Merging of the local FS files threw an exception

2008-10-01 Thread Per Jacobsson

I've collected the syslogs from the failed reduce jobs.  What's the best way
to get them to you? Let me know if you need anything else, I'll have to shut
down these instances some time later today.

Overall I've run this same job before with no problems. The only change is
the added gzip of the output. Don't know if it's worth anything, but the
four failures all happened on different machines. I'll be running this job
plenty of times so if the problem keeps happening it will be obvious.
/ Per

On Wed, Oct 1, 2008 at 11:23 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

>
> Do you still have the task logs for the reduce?
>
> I suspect are running into
> http://issues.apache.org/jira/browse/HADOOP-3647 which we never could
> reproduce reliably to pin it down or fix.
>
> However, in light of http://issues.apache.org/jira/browse/HADOOP-4277 we
> suspect this could be caused by a bug in the LocalFileSystem which could
> hide data-corruption on your local disk leading to errors on these nature.
> Could you try running your job with that patch once the release 0.18.2 is
> available?
>
> Any information you provide could greatly aid to confirm our above
> hypothesis, so it's much appreciated!
>
> Arun
>
>

Re: Merging of the local FS files threw an exception

2008-10-01 Thread Arun C Murthy


On Oct 1, 2008, at 11:07 AM, Per Jacobsson wrote:
I ran a job last night with Hadoop 0.18.0 on EC2, using the standard  
small
AMI. The job was producing gzipped output, otherwise I haven't  
changed the

configuration.

The final reduce steps failed with this error that I haven't seem  
before:


2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200809301822_0005_r_01_0 Merging of the local FS files  
threw an
exception: java.io.IOException: java.io.IOException: Rec# 289050:  
Negative

value-length: -96


Do you still have the task logs for the reduce?

I suspect are running into http://issues.apache.org/jira/browse/HADOOP-3647 
 which we never could reproduce reliably to pin it down or fix.


However, in light of http://issues.apache.org/jira/browse/HADOOP-4277  
we suspect this could be caused by a bug in the LocalFileSystem which  
could hide data-corruption on your local disk leading to errors on  
these nature. Could you try running your job with that patch once the  
release 0.18.2 is available?


Any information you provide could greatly aid to confirm our above  
hypothesis, so it's much appreciated!


Arun

Re: Merging of the local FS files threw an exception

Re: Merging of the local FS files threw an exception

Re: Merging of the local FS files threw an exception

Re: Merging of the local FS files threw an exception

Re: Merging of the local FS files threw an exception

5 matches

Site Navigation

Mail list logo

Footer information