Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

bmdevelopment Wed, 07 Jul 2010 01:03:22 -0700

Hi, No problems. Thanks so much for your time. Greatly appreciated.

I agree that it must be a configuration problem and so today I was able
to start from scratch and did a fresh install of 0.20.2 on the 5 node cluster.


I've now noticed that the error occurs when compression is enabled.
I've run the basic wordcount example as so:
http://pastebin.com/wvDMZZT0
and get the Shuffle Error.

TT logs show this error:
WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
header checksum: 225702cc (expected 0x2325)
Full logs:
http://pastebin.com/fVGjcGsW

My mapred-site.xml:
http://pastebin.com/mQgMrKQw

If I remove the compression config settings, the wordcount works fine
- no more Shuffle Error.
So, I have something wrong with my compression settings I imagine.
I'll continue looking into this to see what else I can find out.

Thanks a million.

On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[email protected]> wrote:
> Hi,
>
> Sorry, I couldn't take a close look at the logs until now.
> Unfortunately, I could not see any huge difference between the success
> and failure case. Can you please check if things like basic hostname -
> ip address mapping are in place (if you have static resolution of
> hostnames set up) ? A web search is giving this as the most likely
> cause users have faced regarding this problem. Also do the disks have
> enough size ? Also, it would be great if you can upload your hadoop
> configuration information.
>
> I do think it is very likely that configuration is the actual problem
> because it works in one case anyway.
>
> Thanks
> Hemanth
>
> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[email protected]> 
> wrote:
>> Hello,
>> I still have had no luck with this over the past week.
>> And even get the same exact problem on a completely different 5 node cluster.
>> Is it worth opening an new issue in jira for this?
>> Thanks
>>
>>
>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[email protected]> 
>> wrote:
>>> Hello,
>>> Thanks so much for the reply.
>>> See inline.
>>>
>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[email protected]> 
>>> wrote:
>>>> Hi,
>>>>
>>>>> I've been getting the following error when trying to run a very simple
>>>>> MapReduce job.
>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>> Reduce phase.
>>>>>
>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>>
>>>>> I am running a 5 node cluster and I believe I have all my settings 
>>>>> correct:
>>>>>
>>>>> * ulimit -n 32768
>>>>> * DNS/RDNS configured properly
>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>>
>>>>> The program is very simple - just counts a unique string in a log file.
>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>>
>>>>> When I run, the job fails and I get the following output.
>>>>> http://pastebin.com/AhW6StEb
>>>>>
>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>> map function in code above).
>>>>>
>>>>> This runs fine and completes successfully:
>>>>>            String str = val.toString();
>>>>>
>>>>> This causes error and fails:
>>>>>            String str = val.toString().substring(0,10);
>>>>>
>>>>> Please let me know if you need any further information.
>>>>> It would be greatly appreciated if anyone could shed some light on this 
>>>>> problem.
>>>>
>>>> It catches attention that changing the code to use a substring is
>>>> causing a difference. Assuming it is consistent and not a red herring,
>>>
>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>> first and then
>>> upgrade to 0.20.2 but results have been exactly the same.
>>>
>>>> can you look at the counters for the two jobs using the JobTracker web
>>>> UI - things like map records, bytes etc and see if there is a
>>>> noticeable difference ?
>>>
>>> Ok, so here is the first job using write.set(value.toString()); having
>>> *no* errors:
>>> http://pastebin.com/xvy0iGwL
>>>
>>> And here is the second job using
>>> write.set(value.toString().substring(0, 10)); that fails:
>>> http://pastebin.com/uGw6yNqv
>>>
>>> And here is even another where I used a longer, and therefore unique string,
>>> by write.set(value.toString().substring(0, 20)); This makes every line
>>> unique, similar to first job.
>>> Still fails.
>>> http://pastebin.com/GdQ1rp8i
>>>
>>>>Also, are the two programs being run against
>>>> the exact same input data ?
>>>
>>> Yes, exactly the same input: a single csv file with 23K lines.
>>> Using a shorter string leads to more like keys and therefore more
>>> combining/reducing, but going
>>> by the above it seems to fail whether the substring/key is entirely
>>> unique (23000 combine output records) or
>>> mostly the same (9 combine output records).
>>>
>>>>
>>>> Also, since the cluster size is small, you could also look at the
>>>> tasktracker logs on the machines where the maps have run to see if
>>>> there are any failures when the reduce attempts start failing.
>>>
>>> Here is the TT log from the last failed job. I do not see anything
>>> besides the shuffle failure, but there
>>> may be something I am overlooking or simply do not understand.
>>> http://pastebin.com/DKFTyGXg
>>>
>>> Thanks again!
>>>
>>>>
>>>> Thanks
>>>> Hemanth
>>>>
>>>
>>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Reply via email to