Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

bmdevelopment Mon, 05 Jul 2010 00:13:04 -0700

Hello,
I still have had no luck with this over the past week.
And even get the same exact problem on a completely different 5 node cluster.
Is it worth opening an new issue in jira for this?
Thanks



On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[email protected]> wrote:
> Hello,
> Thanks so much for the reply.
> See inline.
>
> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[email protected]> wrote:
>> Hi,
>>
>>> I've been getting the following error when trying to run a very simple
>>> MapReduce job.
>>> Map finishes without problem, but error occurs as soon as it enters
>>> Reduce phase.
>>>
>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>
>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>
>>> * ulimit -n 32768
>>> * DNS/RDNS configured properly
>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>
>>> The program is very simple - just counts a unique string in a log file.
>>> See here: http://pastebin.com/5uRG3SFL
>>>
>>> When I run, the job fails and I get the following output.
>>> http://pastebin.com/AhW6StEb
>>>
>>> However, runs fine when I do *not* use substring() on the value (see
>>> map function in code above).
>>>
>>> This runs fine and completes successfully:
>>>            String str = val.toString();
>>>
>>> This causes error and fails:
>>>            String str = val.toString().substring(0,10);
>>>
>>> Please let me know if you need any further information.
>>> It would be greatly appreciated if anyone could shed some light on this 
>>> problem.
>>
>> It catches attention that changing the code to use a substring is
>> causing a difference. Assuming it is consistent and not a red herring,
>
> Yes, this has been consistent over the last week. I was running 0.20.1
> first and then
> upgrade to 0.20.2 but results have been exactly the same.
>
>> can you look at the counters for the two jobs using the JobTracker web
>> UI - things like map records, bytes etc and see if there is a
>> noticeable difference ?
>
> Ok, so here is the first job using write.set(value.toString()); having
> *no* errors:
> http://pastebin.com/xvy0iGwL
>
> And here is the second job using
> write.set(value.toString().substring(0, 10)); that fails:
> http://pastebin.com/uGw6yNqv
>
> And here is even another where I used a longer, and therefore unique string,
> by write.set(value.toString().substring(0, 20)); This makes every line
> unique, similar to first job.
> Still fails.
> http://pastebin.com/GdQ1rp8i
>
>>Also, are the two programs being run against
>> the exact same input data ?
>
> Yes, exactly the same input: a single csv file with 23K lines.
> Using a shorter string leads to more like keys and therefore more
> combining/reducing, but going
> by the above it seems to fail whether the substring/key is entirely
> unique (23000 combine output records) or
> mostly the same (9 combine output records).
>
>>
>> Also, since the cluster size is small, you could also look at the
>> tasktracker logs on the machines where the maps have run to see if
>> there are any failures when the reduce attempts start failing.
>
> Here is the TT log from the last failed job. I do not see anything
> besides the shuffle failure, but there
> may be something I am overlooking or simply do not understand.
> http://pastebin.com/DKFTyGXg
>
> Thanks again!
>
>>
>> Thanks
>> Hemanth
>>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Reply via email to