I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically mention this potential issue so that other people can avoid such problem. Feel free to add more onto it.
On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <[email protected]>wrote: > Thanks everyone. > > Yes, using the Google Code version referenced on the wiki: > http://wiki.apache.org/hadoop/UsingLzoCompression > > I will try the latest version and see if that fixes the problem. > http://github.com/kevinweil/hadoop-lzo > > Thanks > > On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <[email protected]> wrote: > > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[email protected]> wrote: > >> > >> Todd fixed a bug where LZO header or block header data may fall on read > >> boundary: > >> > >> > http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 > >> > >> > >> I am wondering if that is related to the issue you saw. > > > > I don't think this bug would show up in intermediate output compression, > but > > it's certainly possible. There have been a number of bugs fixed in LZO > over > > on github - are you using the github version or the one from Google Code > > which is out of date? Either mine or Kevin's repo on github should be a > good > > version (I think we called the newest 0.3.4) > > -Todd > > > >> > >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <[email protected] > > > >> wrote: > >>> > >>> A little more on this. > >>> > >>> So, I've narrowed down the problem to using Lzop compression > >>> (com.hadoop.compression.lzo.LzopCodec) > >>> for mapred.map.output.compression.codec. > >>> > >>> <property> > >>> <name>mapred.map.output.compression.codec</name> > >>> <value>com.hadoop.compression.lzo.LzopCodec</value> > >>> </property> > >>> > >>> If I do the above, I will get the Shuffle Error. > >>> If I use DefaultCodec for mapred.map.output.compression.codec. > >>> there is no problem. > >>> > >>> Is this a known issue? Or is this a bug? > >>> Doesn't seem like it should be the expected behavior. > >>> > >>> I would be glad to contribute any further info on this if necessary. > >>> Please let me know. > >>> > >>> Thanks > >>> > >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[email protected] > > > >>> wrote: > >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated. > >>> > > >>> > I agree that it must be a configuration problem and so today I was > able > >>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node > >>> > cluster. > >>> > > >>> > I've now noticed that the error occurs when compression is enabled. > >>> > I've run the basic wordcount example as so: > >>> > http://pastebin.com/wvDMZZT0 > >>> > and get the Shuffle Error. > >>> > > >>> > TT logs show this error: > >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: > Invalid > >>> > header checksum: 225702cc (expected 0x2325) > >>> > Full logs: > >>> > http://pastebin.com/fVGjcGsW > >>> > > >>> > My mapred-site.xml: > >>> > http://pastebin.com/mQgMrKQw > >>> > > >>> > If I remove the compression config settings, the wordcount works fine > >>> > - no more Shuffle Error. > >>> > So, I have something wrong with my compression settings I imagine. > >>> > I'll continue looking into this to see what else I can find out. > >>> > > >>> > Thanks a million. > >>> > > >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[email protected] > > > >>> > wrote: > >>> >> Hi, > >>> >> > >>> >> Sorry, I couldn't take a close look at the logs until now. > >>> >> Unfortunately, I could not see any huge difference between the > success > >>> >> and failure case. Can you please check if things like basic hostname > - > >>> >> ip address mapping are in place (if you have static resolution of > >>> >> hostnames set up) ? A web search is giving this as the most likely > >>> >> cause users have faced regarding this problem. Also do the disks > have > >>> >> enough size ? Also, it would be great if you can upload your hadoop > >>> >> configuration information. > >>> >> > >>> >> I do think it is very likely that configuration is the actual > problem > >>> >> because it works in one case anyway. > >>> >> > >>> >> Thanks > >>> >> Hemanth > >>> >> > >>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment > >>> >> <[email protected]> wrote: > >>> >>> Hello, > >>> >>> I still have had no luck with this over the past week. > >>> >>> And even get the same exact problem on a completely different 5 > node > >>> >>> cluster. > >>> >>> Is it worth opening an new issue in jira for this? > >>> >>> Thanks > >>> >>> > >>> >>> > >>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment > >>> >>> <[email protected]> wrote: > >>> >>>> Hello, > >>> >>>> Thanks so much for the reply. > >>> >>>> See inline. > >>> >>>> > >>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala > >>> >>>> <[email protected]> wrote: > >>> >>>>> Hi, > >>> >>>>> > >>> >>>>>> I've been getting the following error when trying to run a very > >>> >>>>>> simple > >>> >>>>>> MapReduce job. > >>> >>>>>> Map finishes without problem, but error occurs as soon as it > >>> >>>>>> enters > >>> >>>>>> Reduce phase. > >>> >>>>>> > >>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : > >>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED > >>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > >>> >>>>>> > >>> >>>>>> I am running a 5 node cluster and I believe I have all my > settings > >>> >>>>>> correct: > >>> >>>>>> > >>> >>>>>> * ulimit -n 32768 > >>> >>>>>> * DNS/RDNS configured properly > >>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM > >>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW > >>> >>>>>> > >>> >>>>>> The program is very simple - just counts a unique string in a > log > >>> >>>>>> file. > >>> >>>>>> See here: http://pastebin.com/5uRG3SFL > >>> >>>>>> > >>> >>>>>> When I run, the job fails and I get the following output. > >>> >>>>>> http://pastebin.com/AhW6StEb > >>> >>>>>> > >>> >>>>>> However, runs fine when I do *not* use substring() on the value > >>> >>>>>> (see > >>> >>>>>> map function in code above). > >>> >>>>>> > >>> >>>>>> This runs fine and completes successfully: > >>> >>>>>> String str = val.toString(); > >>> >>>>>> > >>> >>>>>> This causes error and fails: > >>> >>>>>> String str = val.toString().substring(0,10); > >>> >>>>>> > >>> >>>>>> Please let me know if you need any further information. > >>> >>>>>> It would be greatly appreciated if anyone could shed some light > on > >>> >>>>>> this problem. > >>> >>>>> > >>> >>>>> It catches attention that changing the code to use a substring is > >>> >>>>> causing a difference. Assuming it is consistent and not a red > >>> >>>>> herring, > >>> >>>> > >>> >>>> Yes, this has been consistent over the last week. I was running > >>> >>>> 0.20.1 > >>> >>>> first and then > >>> >>>> upgrade to 0.20.2 but results have been exactly the same. > >>> >>>> > >>> >>>>> can you look at the counters for the two jobs using the > JobTracker > >>> >>>>> web > >>> >>>>> UI - things like map records, bytes etc and see if there is a > >>> >>>>> noticeable difference ? > >>> >>>> > >>> >>>> Ok, so here is the first job using write.set(value.toString()); > >>> >>>> having > >>> >>>> *no* errors: > >>> >>>> http://pastebin.com/xvy0iGwL > >>> >>>> > >>> >>>> And here is the second job using > >>> >>>> write.set(value.toString().substring(0, 10)); that fails: > >>> >>>> http://pastebin.com/uGw6yNqv > >>> >>>> > >>> >>>> And here is even another where I used a longer, and therefore > unique > >>> >>>> string, > >>> >>>> by write.set(value.toString().substring(0, 20)); This makes every > >>> >>>> line > >>> >>>> unique, similar to first job. > >>> >>>> Still fails. > >>> >>>> http://pastebin.com/GdQ1rp8i > >>> >>>> > >>> >>>>>Also, are the two programs being run against > >>> >>>>> the exact same input data ? > >>> >>>> > >>> >>>> Yes, exactly the same input: a single csv file with 23K lines. > >>> >>>> Using a shorter string leads to more like keys and therefore more > >>> >>>> combining/reducing, but going > >>> >>>> by the above it seems to fail whether the substring/key is > entirely > >>> >>>> unique (23000 combine output records) or > >>> >>>> mostly the same (9 combine output records). > >>> >>>> > >>> >>>>> > >>> >>>>> Also, since the cluster size is small, you could also look at the > >>> >>>>> tasktracker logs on the machines where the maps have run to see > if > >>> >>>>> there are any failures when the reduce attempts start failing. > >>> >>>> > >>> >>>> Here is the TT log from the last failed job. I do not see anything > >>> >>>> besides the shuffle failure, but there > >>> >>>> may be something I am overlooking or simply do not understand. > >>> >>>> http://pastebin.com/DKFTyGXg > >>> >>>> > >>> >>>> Thanks again! > >>> >>>> > >>> >>>>> > >>> >>>>> Thanks > >>> >>>>> Hemanth > >>> >>>>> > >>> >>>> > >>> >>> > >>> >> > >>> > > >> > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > >
