Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Ted Yu Thu, 08 Jul 2010 21:56:04 -0700

I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically
mention this potential issue so that other people can avoid such problem.
Feel free to add more onto it.


On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <[email protected]>wrote:

> Thanks everyone.
>
> Yes, using the Google Code version referenced on the wiki:
> http://wiki.apache.org/hadoop/UsingLzoCompression
>
> I will try the latest version and see if that fixes the problem.
> http://github.com/kevinweil/hadoop-lzo
>
> Thanks
>
> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <[email protected]> wrote:
> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[email protected]> wrote:
> >>
> >> Todd fixed a bug where LZO header or block header data may fall on read
> >> boundary:
> >>
> >>
> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
> >>
> >>
> >> I am wondering if that is related to the issue you saw.
> >
> > I don't think this bug would show up in intermediate output compression,
> but
> > it's certainly possible. There have been a number of bugs fixed in LZO
> over
> > on github - are you using the github version or the one from Google Code
> > which is out of date? Either mine or Kevin's repo on github should be a
> good
> > version (I think we called the newest 0.3.4)
> > -Todd
> >
> >>
> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <[email protected]
> >
> >> wrote:
> >>>
> >>> A little more on this.
> >>>
> >>> So, I've narrowed down the problem to using Lzop compression
> >>> (com.hadoop.compression.lzo.LzopCodec)
> >>> for mapred.map.output.compression.codec.
> >>>
> >>> <property>
> >>>    <name>mapred.map.output.compression.codec</name>
> >>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
> >>> </property>
> >>>
> >>> If I do the above, I will get the Shuffle Error.
> >>> If I use DefaultCodec for mapred.map.output.compression.codec.
> >>> there is no problem.
> >>>
> >>> Is this a known issue? Or is this a bug?
> >>> Doesn't seem like it should be the expected behavior.
> >>>
> >>> I would be glad to contribute any further info on this if necessary.
> >>> Please let me know.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[email protected]
> >
> >>> wrote:
> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
> >>> >
> >>> > I agree that it must be a configuration problem and so today I was
> able
> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
> >>> > cluster.
> >>> >
> >>> > I've now noticed that the error occurs when compression is enabled.
> >>> > I've run the basic wordcount example as so:
> >>> > http://pastebin.com/wvDMZZT0
> >>> > and get the Shuffle Error.
> >>> >
> >>> > TT logs show this error:
> >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException:
> Invalid
> >>> > header checksum: 225702cc (expected 0x2325)
> >>> > Full logs:
> >>> > http://pastebin.com/fVGjcGsW
> >>> >
> >>> > My mapred-site.xml:
> >>> > http://pastebin.com/mQgMrKQw
> >>> >
> >>> > If I remove the compression config settings, the wordcount works fine
> >>> > - no more Shuffle Error.
> >>> > So, I have something wrong with my compression settings I imagine.
> >>> > I'll continue looking into this to see what else I can find out.
> >>> >
> >>> > Thanks a million.
> >>> >
> >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[email protected]
> >
> >>> > wrote:
> >>> >> Hi,
> >>> >>
> >>> >> Sorry, I couldn't take a close look at the logs until now.
> >>> >> Unfortunately, I could not see any huge difference between the
> success
> >>> >> and failure case. Can you please check if things like basic hostname
> -
> >>> >> ip address mapping are in place (if you have static resolution of
> >>> >> hostnames set up) ? A web search is giving this as the most likely
> >>> >> cause users have faced regarding this problem. Also do the disks
> have
> >>> >> enough size ? Also, it would be great if you can upload your hadoop
> >>> >> configuration information.
> >>> >>
> >>> >> I do think it is very likely that configuration is the actual
> problem
> >>> >> because it works in one case anyway.
> >>> >>
> >>> >> Thanks
> >>> >> Hemanth
> >>> >>
> >>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment
> >>> >> <[email protected]> wrote:
> >>> >>> Hello,
> >>> >>> I still have had no luck with this over the past week.
> >>> >>> And even get the same exact problem on a completely different 5
> node
> >>> >>> cluster.
> >>> >>> Is it worth opening an new issue in jira for this?
> >>> >>> Thanks
> >>> >>>
> >>> >>>
> >>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment
> >>> >>> <[email protected]> wrote:
> >>> >>>> Hello,
> >>> >>>> Thanks so much for the reply.
> >>> >>>> See inline.
> >>> >>>>
> >>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala
> >>> >>>> <[email protected]> wrote:
> >>> >>>>> Hi,
> >>> >>>>>
> >>> >>>>>> I've been getting the following error when trying to run a very
> >>> >>>>>> simple
> >>> >>>>>> MapReduce job.
> >>> >>>>>> Map finishes without problem, but error occurs as soon as it
> >>> >>>>>> enters
> >>> >>>>>> Reduce phase.
> >>> >>>>>>
> >>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>> >>>>>>
> >>> >>>>>> I am running a 5 node cluster and I believe I have all my
> settings
> >>> >>>>>> correct:
> >>> >>>>>>
> >>> >>>>>> * ulimit -n 32768
> >>> >>>>>> * DNS/RDNS configured properly
> >>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>> >>>>>>
> >>> >>>>>> The program is very simple - just counts a unique string in a
> log
> >>> >>>>>> file.
> >>> >>>>>> See here: http://pastebin.com/5uRG3SFL
> >>> >>>>>>
> >>> >>>>>> When I run, the job fails and I get the following output.
> >>> >>>>>> http://pastebin.com/AhW6StEb
> >>> >>>>>>
> >>> >>>>>> However, runs fine when I do *not* use substring() on the value
> >>> >>>>>> (see
> >>> >>>>>> map function in code above).
> >>> >>>>>>
> >>> >>>>>> This runs fine and completes successfully:
> >>> >>>>>>            String str = val.toString();
> >>> >>>>>>
> >>> >>>>>> This causes error and fails:
> >>> >>>>>>            String str = val.toString().substring(0,10);
> >>> >>>>>>
> >>> >>>>>> Please let me know if you need any further information.
> >>> >>>>>> It would be greatly appreciated if anyone could shed some light
> on
> >>> >>>>>> this problem.
> >>> >>>>>
> >>> >>>>> It catches attention that changing the code to use a substring is
> >>> >>>>> causing a difference. Assuming it is consistent and not a red
> >>> >>>>> herring,
> >>> >>>>
> >>> >>>> Yes, this has been consistent over the last week. I was running
> >>> >>>> 0.20.1
> >>> >>>> first and then
> >>> >>>> upgrade to 0.20.2 but results have been exactly the same.
> >>> >>>>
> >>> >>>>> can you look at the counters for the two jobs using the
> JobTracker
> >>> >>>>> web
> >>> >>>>> UI - things like map records, bytes etc and see if there is a
> >>> >>>>> noticeable difference ?
> >>> >>>>
> >>> >>>> Ok, so here is the first job using write.set(value.toString());
> >>> >>>> having
> >>> >>>> *no* errors:
> >>> >>>> http://pastebin.com/xvy0iGwL
> >>> >>>>
> >>> >>>> And here is the second job using
> >>> >>>> write.set(value.toString().substring(0, 10)); that fails:
> >>> >>>> http://pastebin.com/uGw6yNqv
> >>> >>>>
> >>> >>>> And here is even another where I used a longer, and therefore
> unique
> >>> >>>> string,
> >>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
> >>> >>>> line
> >>> >>>> unique, similar to first job.
> >>> >>>> Still fails.
> >>> >>>> http://pastebin.com/GdQ1rp8i
> >>> >>>>
> >>> >>>>>Also, are the two programs being run against
> >>> >>>>> the exact same input data ?
> >>> >>>>
> >>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
> >>> >>>> Using a shorter string leads to more like keys and therefore more
> >>> >>>> combining/reducing, but going
> >>> >>>> by the above it seems to fail whether the substring/key is
> entirely
> >>> >>>> unique (23000 combine output records) or
> >>> >>>> mostly the same (9 combine output records).
> >>> >>>>
> >>> >>>>>
> >>> >>>>> Also, since the cluster size is small, you could also look at the
> >>> >>>>> tasktracker logs on the machines where the maps have run to see
> if
> >>> >>>>> there are any failures when the reduce attempts start failing.
> >>> >>>>
> >>> >>>> Here is the TT log from the last failed job. I do not see anything
> >>> >>>> besides the shuffle failure, but there
> >>> >>>> may be something I am overlooking or simply do not understand.
> >>> >>>> http://pastebin.com/DKFTyGXg
> >>> >>>>
> >>> >>>> Thanks again!
> >>> >>>>
> >>> >>>>>
> >>> >>>>> Thanks
> >>> >>>>> Hemanth
> >>> >>>>>
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Reply via email to