Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

bmdevelopment Wed, 07 Jul 2010 23:50:25 -0700

A little more on this.

So, I've narrowed down the problem to using Lzop compression
(com.hadoop.compression.lzo.LzopCodec)
for mapred.map.output.compression.codec.


<property>
    <name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzopCodec</value>
</property>

If I do the above, I will get the Shuffle Error.
If I use DefaultCodec for mapred.map.output.compression.codec.
there is no problem.

Is this a known issue? Or is this a bug?
Doesn't seem like it should be the expected behavior.

I would be glad to contribute any further info on this if necessary.
Please let me know.

Thanks

On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[email protected]> wrote:
> Hi, No problems. Thanks so much for your time. Greatly appreciated.
>
> I agree that it must be a configuration problem and so today I was able
> to start from scratch and did a fresh install of 0.20.2 on the 5 node cluster.
>
> I've now noticed that the error occurs when compression is enabled.
> I've run the basic wordcount example as so:
> http://pastebin.com/wvDMZZT0
> and get the Shuffle Error.
>
> TT logs show this error:
> WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
> header checksum: 225702cc (expected 0x2325)
> Full logs:
> http://pastebin.com/fVGjcGsW
>
> My mapred-site.xml:
> http://pastebin.com/mQgMrKQw
>
> If I remove the compression config settings, the wordcount works fine
> - no more Shuffle Error.
> So, I have something wrong with my compression settings I imagine.
> I'll continue looking into this to see what else I can find out.
>
> Thanks a million.
>
> On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[email protected]> wrote:
>> Hi,
>>
>> Sorry, I couldn't take a close look at the logs until now.
>> Unfortunately, I could not see any huge difference between the success
>> and failure case. Can you please check if things like basic hostname -
>> ip address mapping are in place (if you have static resolution of
>> hostnames set up) ? A web search is giving this as the most likely
>> cause users have faced regarding this problem. Also do the disks have
>> enough size ? Also, it would be great if you can upload your hadoop
>> configuration information.
>>
>> I do think it is very likely that configuration is the actual problem
>> because it works in one case anyway.
>>
>> Thanks
>> Hemanth
>>
>> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[email protected]> 
>> wrote:
>>> Hello,
>>> I still have had no luck with this over the past week.
>>> And even get the same exact problem on a completely different 5 node 
>>> cluster.
>>> Is it worth opening an new issue in jira for this?
>>> Thanks
>>>
>>>
>>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[email protected]> 
>>> wrote:
>>>> Hello,
>>>> Thanks so much for the reply.
>>>> See inline.
>>>>
>>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[email protected]> 
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>>> I've been getting the following error when trying to run a very simple
>>>>>> MapReduce job.
>>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>>> Reduce phase.
>>>>>>
>>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>>>
>>>>>> I am running a 5 node cluster and I believe I have all my settings 
>>>>>> correct:
>>>>>>
>>>>>> * ulimit -n 32768
>>>>>> * DNS/RDNS configured properly
>>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>>>
>>>>>> The program is very simple - just counts a unique string in a log file.
>>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>>>
>>>>>> When I run, the job fails and I get the following output.
>>>>>> http://pastebin.com/AhW6StEb
>>>>>>
>>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>>> map function in code above).
>>>>>>
>>>>>> This runs fine and completes successfully:
>>>>>>            String str = val.toString();
>>>>>>
>>>>>> This causes error and fails:
>>>>>>            String str = val.toString().substring(0,10);
>>>>>>
>>>>>> Please let me know if you need any further information.
>>>>>> It would be greatly appreciated if anyone could shed some light on this 
>>>>>> problem.
>>>>>
>>>>> It catches attention that changing the code to use a substring is
>>>>> causing a difference. Assuming it is consistent and not a red herring,
>>>>
>>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>>> first and then
>>>> upgrade to 0.20.2 but results have been exactly the same.
>>>>
>>>>> can you look at the counters for the two jobs using the JobTracker web
>>>>> UI - things like map records, bytes etc and see if there is a
>>>>> noticeable difference ?
>>>>
>>>> Ok, so here is the first job using write.set(value.toString()); having
>>>> *no* errors:
>>>> http://pastebin.com/xvy0iGwL
>>>>
>>>> And here is the second job using
>>>> write.set(value.toString().substring(0, 10)); that fails:
>>>> http://pastebin.com/uGw6yNqv
>>>>
>>>> And here is even another where I used a longer, and therefore unique 
>>>> string,
>>>> by write.set(value.toString().substring(0, 20)); This makes every line
>>>> unique, similar to first job.
>>>> Still fails.
>>>> http://pastebin.com/GdQ1rp8i
>>>>
>>>>>Also, are the two programs being run against
>>>>> the exact same input data ?
>>>>
>>>> Yes, exactly the same input: a single csv file with 23K lines.
>>>> Using a shorter string leads to more like keys and therefore more
>>>> combining/reducing, but going
>>>> by the above it seems to fail whether the substring/key is entirely
>>>> unique (23000 combine output records) or
>>>> mostly the same (9 combine output records).
>>>>
>>>>>
>>>>> Also, since the cluster size is small, you could also look at the
>>>>> tasktracker logs on the machines where the maps have run to see if
>>>>> there are any failures when the reduce attempts start failing.
>>>>
>>>> Here is the TT log from the last failed job. I do not see anything
>>>> besides the shuffle failure, but there
>>>> may be something I am overlooking or simply do not understand.
>>>> http://pastebin.com/DKFTyGXg
>>>>
>>>> Thanks again!
>>>>
>>>>>
>>>>> Thanks
>>>>> Hemanth
>>>>>
>>>>
>>>
>>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Reply via email to