Todd fixed a bug where LZO header or block header data may fall on read boundary: http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
I am wondering if that is related to the issue you saw. On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <[email protected]>wrote: > A little more on this. > > So, I've narrowed down the problem to using Lzop compression > (com.hadoop.compression.lzo.LzopCodec) > for mapred.map.output.compression.codec. > > <property> > <name>mapred.map.output.compression.codec</name> > <value>com.hadoop.compression.lzo.LzopCodec</value> > </property> > > If I do the above, I will get the Shuffle Error. > If I use DefaultCodec for mapred.map.output.compression.codec. > there is no problem. > > Is this a known issue? Or is this a bug? > Doesn't seem like it should be the expected behavior. > > I would be glad to contribute any further info on this if necessary. > Please let me know. > > Thanks > > On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[email protected]> > wrote: > > Hi, No problems. Thanks so much for your time. Greatly appreciated. > > > > I agree that it must be a configuration problem and so today I was able > > to start from scratch and did a fresh install of 0.20.2 on the 5 node > cluster. > > > > I've now noticed that the error occurs when compression is enabled. > > I've run the basic wordcount example as so: > > http://pastebin.com/wvDMZZT0 > > and get the Shuffle Error. > > > > TT logs show this error: > > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid > > header checksum: 225702cc (expected 0x2325) > > Full logs: > > http://pastebin.com/fVGjcGsW > > > > My mapred-site.xml: > > http://pastebin.com/mQgMrKQw > > > > If I remove the compression config settings, the wordcount works fine > > - no more Shuffle Error. > > So, I have something wrong with my compression settings I imagine. > > I'll continue looking into this to see what else I can find out. > > > > Thanks a million. > > > > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[email protected]> > wrote: > >> Hi, > >> > >> Sorry, I couldn't take a close look at the logs until now. > >> Unfortunately, I could not see any huge difference between the success > >> and failure case. Can you please check if things like basic hostname - > >> ip address mapping are in place (if you have static resolution of > >> hostnames set up) ? A web search is giving this as the most likely > >> cause users have faced regarding this problem. Also do the disks have > >> enough size ? Also, it would be great if you can upload your hadoop > >> configuration information. > >> > >> I do think it is very likely that configuration is the actual problem > >> because it works in one case anyway. > >> > >> Thanks > >> Hemanth > >> > >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[email protected]> > wrote: > >>> Hello, > >>> I still have had no luck with this over the past week. > >>> And even get the same exact problem on a completely different 5 node > cluster. > >>> Is it worth opening an new issue in jira for this? > >>> Thanks > >>> > >>> > >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment < > [email protected]> wrote: > >>>> Hello, > >>>> Thanks so much for the reply. > >>>> See inline. > >>>> > >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala < > [email protected]> wrote: > >>>>> Hi, > >>>>> > >>>>>> I've been getting the following error when trying to run a very > simple > >>>>>> MapReduce job. > >>>>>> Map finishes without problem, but error occurs as soon as it enters > >>>>>> Reduce phase. > >>>>>> > >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : > >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED > >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > >>>>>> > >>>>>> I am running a 5 node cluster and I believe I have all my settings > correct: > >>>>>> > >>>>>> * ulimit -n 32768 > >>>>>> * DNS/RDNS configured properly > >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM > >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW > >>>>>> > >>>>>> The program is very simple - just counts a unique string in a log > file. > >>>>>> See here: http://pastebin.com/5uRG3SFL > >>>>>> > >>>>>> When I run, the job fails and I get the following output. > >>>>>> http://pastebin.com/AhW6StEb > >>>>>> > >>>>>> However, runs fine when I do *not* use substring() on the value (see > >>>>>> map function in code above). > >>>>>> > >>>>>> This runs fine and completes successfully: > >>>>>> String str = val.toString(); > >>>>>> > >>>>>> This causes error and fails: > >>>>>> String str = val.toString().substring(0,10); > >>>>>> > >>>>>> Please let me know if you need any further information. > >>>>>> It would be greatly appreciated if anyone could shed some light on > this problem. > >>>>> > >>>>> It catches attention that changing the code to use a substring is > >>>>> causing a difference. Assuming it is consistent and not a red > herring, > >>>> > >>>> Yes, this has been consistent over the last week. I was running 0.20.1 > >>>> first and then > >>>> upgrade to 0.20.2 but results have been exactly the same. > >>>> > >>>>> can you look at the counters for the two jobs using the JobTracker > web > >>>>> UI - things like map records, bytes etc and see if there is a > >>>>> noticeable difference ? > >>>> > >>>> Ok, so here is the first job using write.set(value.toString()); having > >>>> *no* errors: > >>>> http://pastebin.com/xvy0iGwL > >>>> > >>>> And here is the second job using > >>>> write.set(value.toString().substring(0, 10)); that fails: > >>>> http://pastebin.com/uGw6yNqv > >>>> > >>>> And here is even another where I used a longer, and therefore unique > string, > >>>> by write.set(value.toString().substring(0, 20)); This makes every line > >>>> unique, similar to first job. > >>>> Still fails. > >>>> http://pastebin.com/GdQ1rp8i > >>>> > >>>>>Also, are the two programs being run against > >>>>> the exact same input data ? > >>>> > >>>> Yes, exactly the same input: a single csv file with 23K lines. > >>>> Using a shorter string leads to more like keys and therefore more > >>>> combining/reducing, but going > >>>> by the above it seems to fail whether the substring/key is entirely > >>>> unique (23000 combine output records) or > >>>> mostly the same (9 combine output records). > >>>> > >>>>> > >>>>> Also, since the cluster size is small, you could also look at the > >>>>> tasktracker logs on the machines where the maps have run to see if > >>>>> there are any failures when the reduce attempts start failing. > >>>> > >>>> Here is the TT log from the last failed job. I do not see anything > >>>> besides the shuffle failure, but there > >>>> may be something I am overlooking or simply do not understand. > >>>> http://pastebin.com/DKFTyGXg > >>>> > >>>> Thanks again! > >>>> > >>>>> > >>>>> Thanks > >>>>> Hemanth > >>>>> > >>>> > >>> > >> > > >
