Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Todd Lipcon Thu, 08 Jul 2010 11:24:33 -0700

On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[email protected]> wrote:

> Todd fixed a bug where LZO header or block header data may fall on read
> boundary:
>
> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>


I am wondering if that is related to the issue you saw.
>
> I don't think this bug would show up in intermediate output compression,
but it's certainly possible. There have been a number of bugs fixed in LZO
over on github - are you using the github version or the one from Google
Code which is out of date? Either mine or Kevin's repo on github should be a
good version (I think we called the newest 0.3.4)

-Todd


>
> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <[email protected]>wrote:
>
>> A little more on this.
>>
>> So, I've narrowed down the problem to using Lzop compression
>> (com.hadoop.compression.lzo.LzopCodec)
>> for mapred.map.output.compression.codec.
>>
>> <property>
>>    <name>mapred.map.output.compression.codec</name>
>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>> </property>
>>
>> If I do the above, I will get the Shuffle Error.
>> If I use DefaultCodec for mapred.map.output.compression.codec.
>> there is no problem.
>>
>> Is this a known issue? Or is this a bug?
>> Doesn't seem like it should be the expected behavior.
>>
>> I would be glad to contribute any further info on this if necessary.
>> Please let me know.
>>
>> Thanks
>>
>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[email protected]>
>> wrote:
>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>> >
>> > I agree that it must be a configuration problem and so today I was able
>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
>> cluster.
>> >
>> > I've now noticed that the error occurs when compression is enabled.
>> > I've run the basic wordcount example as so:
>> > http://pastebin.com/wvDMZZT0
>> > and get the Shuffle Error.
>> >
>> > TT logs show this error:
>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
>> > header checksum: 225702cc (expected 0x2325)
>> > Full logs:
>> > http://pastebin.com/fVGjcGsW
>> >
>> > My mapred-site.xml:
>> > http://pastebin.com/mQgMrKQw
>> >
>> > If I remove the compression config settings, the wordcount works fine
>> > - no more Shuffle Error.
>> > So, I have something wrong with my compression settings I imagine.
>> > I'll continue looking into this to see what else I can find out.
>> >
>> > Thanks a million.
>> >
>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[email protected]>
>> wrote:
>> >> Hi,
>> >>
>> >> Sorry, I couldn't take a close look at the logs until now.
>> >> Unfortunately, I could not see any huge difference between the success
>> >> and failure case. Can you please check if things like basic hostname -
>> >> ip address mapping are in place (if you have static resolution of
>> >> hostnames set up) ? A web search is giving this as the most likely
>> >> cause users have faced regarding this problem. Also do the disks have
>> >> enough size ? Also, it would be great if you can upload your hadoop
>> >> configuration information.
>> >>
>> >> I do think it is very likely that configuration is the actual problem
>> >> because it works in one case anyway.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <
>> [email protected]> wrote:
>> >>> Hello,
>> >>> I still have had no luck with this over the past week.
>> >>> And even get the same exact problem on a completely different 5 node
>> cluster.
>> >>> Is it worth opening an new issue in jira for this?
>> >>> Thanks
>> >>>
>> >>>
>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <
>> [email protected]> wrote:
>> >>>> Hello,
>> >>>> Thanks so much for the reply.
>> >>>> See inline.
>> >>>>
>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <
>> [email protected]> wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>>> I've been getting the following error when trying to run a very
>> simple
>> >>>>>> MapReduce job.
>> >>>>>> Map finishes without problem, but error occurs as soon as it enters
>> >>>>>> Reduce phase.
>> >>>>>>
>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> >>>>>>
>> >>>>>> I am running a 5 node cluster and I believe I have all my settings
>> correct:
>> >>>>>>
>> >>>>>> * ulimit -n 32768
>> >>>>>> * DNS/RDNS configured properly
>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>> >>>>>>
>> >>>>>> The program is very simple - just counts a unique string in a log
>> file.
>> >>>>>> See here: http://pastebin.com/5uRG3SFL
>> >>>>>>
>> >>>>>> When I run, the job fails and I get the following output.
>> >>>>>> http://pastebin.com/AhW6StEb
>> >>>>>>
>> >>>>>> However, runs fine when I do *not* use substring() on the value
>> (see
>> >>>>>> map function in code above).
>> >>>>>>
>> >>>>>> This runs fine and completes successfully:
>> >>>>>>            String str = val.toString();
>> >>>>>>
>> >>>>>> This causes error and fails:
>> >>>>>>            String str = val.toString().substring(0,10);
>> >>>>>>
>> >>>>>> Please let me know if you need any further information.
>> >>>>>> It would be greatly appreciated if anyone could shed some light on
>> this problem.
>> >>>>>
>> >>>>> It catches attention that changing the code to use a substring is
>> >>>>> causing a difference. Assuming it is consistent and not a red
>> herring,
>> >>>>
>> >>>> Yes, this has been consistent over the last week. I was running
>> 0.20.1
>> >>>> first and then
>> >>>> upgrade to 0.20.2 but results have been exactly the same.
>> >>>>
>> >>>>> can you look at the counters for the two jobs using the JobTracker
>> web
>> >>>>> UI - things like map records, bytes etc and see if there is a
>> >>>>> noticeable difference ?
>> >>>>
>> >>>> Ok, so here is the first job using write.set(value.toString());
>> having
>> >>>> *no* errors:
>> >>>> http://pastebin.com/xvy0iGwL
>> >>>>
>> >>>> And here is the second job using
>> >>>> write.set(value.toString().substring(0, 10)); that fails:
>> >>>> http://pastebin.com/uGw6yNqv
>> >>>>
>> >>>> And here is even another where I used a longer, and therefore unique
>> string,
>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
>> line
>> >>>> unique, similar to first job.
>> >>>> Still fails.
>> >>>> http://pastebin.com/GdQ1rp8i
>> >>>>
>> >>>>>Also, are the two programs being run against
>> >>>>> the exact same input data ?
>> >>>>
>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
>> >>>> Using a shorter string leads to more like keys and therefore more
>> >>>> combining/reducing, but going
>> >>>> by the above it seems to fail whether the substring/key is entirely
>> >>>> unique (23000 combine output records) or
>> >>>> mostly the same (9 combine output records).
>> >>>>
>> >>>>>
>> >>>>> Also, since the cluster size is small, you could also look at the
>> >>>>> tasktracker logs on the machines where the maps have run to see if
>> >>>>> there are any failures when the reduce attempts start failing.
>> >>>>
>> >>>> Here is the TT log from the last failed job. I do not see anything
>> >>>> besides the shuffle failure, but there
>> >>>> may be something I am overlooking or simply do not understand.
>> >>>> http://pastebin.com/DKFTyGXg
>> >>>>
>> >>>> Thanks again!
>> >>>>
>> >>>>>
>> >>>>> Thanks
>> >>>>> Hemanth
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Reply via email to