On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[email protected]> wrote: > Todd fixed a bug where LZO header or block header data may fall on read > boundary: > > http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 >
I am wondering if that is related to the issue you saw. > > I don't think this bug would show up in intermediate output compression, but it's certainly possible. There have been a number of bugs fixed in LZO over on github - are you using the github version or the one from Google Code which is out of date? Either mine or Kevin's repo on github should be a good version (I think we called the newest 0.3.4) -Todd > > On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <[email protected]>wrote: > >> A little more on this. >> >> So, I've narrowed down the problem to using Lzop compression >> (com.hadoop.compression.lzo.LzopCodec) >> for mapred.map.output.compression.codec. >> >> <property> >> <name>mapred.map.output.compression.codec</name> >> <value>com.hadoop.compression.lzo.LzopCodec</value> >> </property> >> >> If I do the above, I will get the Shuffle Error. >> If I use DefaultCodec for mapred.map.output.compression.codec. >> there is no problem. >> >> Is this a known issue? Or is this a bug? >> Doesn't seem like it should be the expected behavior. >> >> I would be glad to contribute any further info on this if necessary. >> Please let me know. >> >> Thanks >> >> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[email protected]> >> wrote: >> > Hi, No problems. Thanks so much for your time. Greatly appreciated. >> > >> > I agree that it must be a configuration problem and so today I was able >> > to start from scratch and did a fresh install of 0.20.2 on the 5 node >> cluster. >> > >> > I've now noticed that the error occurs when compression is enabled. >> > I've run the basic wordcount example as so: >> > http://pastebin.com/wvDMZZT0 >> > and get the Shuffle Error. >> > >> > TT logs show this error: >> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid >> > header checksum: 225702cc (expected 0x2325) >> > Full logs: >> > http://pastebin.com/fVGjcGsW >> > >> > My mapred-site.xml: >> > http://pastebin.com/mQgMrKQw >> > >> > If I remove the compression config settings, the wordcount works fine >> > - no more Shuffle Error. >> > So, I have something wrong with my compression settings I imagine. >> > I'll continue looking into this to see what else I can find out. >> > >> > Thanks a million. >> > >> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[email protected]> >> wrote: >> >> Hi, >> >> >> >> Sorry, I couldn't take a close look at the logs until now. >> >> Unfortunately, I could not see any huge difference between the success >> >> and failure case. Can you please check if things like basic hostname - >> >> ip address mapping are in place (if you have static resolution of >> >> hostnames set up) ? A web search is giving this as the most likely >> >> cause users have faced regarding this problem. Also do the disks have >> >> enough size ? Also, it would be great if you can upload your hadoop >> >> configuration information. >> >> >> >> I do think it is very likely that configuration is the actual problem >> >> because it works in one case anyway. >> >> >> >> Thanks >> >> Hemanth >> >> >> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment < >> [email protected]> wrote: >> >>> Hello, >> >>> I still have had no luck with this over the past week. >> >>> And even get the same exact problem on a completely different 5 node >> cluster. >> >>> Is it worth opening an new issue in jira for this? >> >>> Thanks >> >>> >> >>> >> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment < >> [email protected]> wrote: >> >>>> Hello, >> >>>> Thanks so much for the reply. >> >>>> See inline. >> >>>> >> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala < >> [email protected]> wrote: >> >>>>> Hi, >> >>>>> >> >>>>>> I've been getting the following error when trying to run a very >> simple >> >>>>>> MapReduce job. >> >>>>>> Map finishes without problem, but error occurs as soon as it enters >> >>>>>> Reduce phase. >> >>>>>> >> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED >> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> >>>>>> >> >>>>>> I am running a 5 node cluster and I believe I have all my settings >> correct: >> >>>>>> >> >>>>>> * ulimit -n 32768 >> >>>>>> * DNS/RDNS configured properly >> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW >> >>>>>> >> >>>>>> The program is very simple - just counts a unique string in a log >> file. >> >>>>>> See here: http://pastebin.com/5uRG3SFL >> >>>>>> >> >>>>>> When I run, the job fails and I get the following output. >> >>>>>> http://pastebin.com/AhW6StEb >> >>>>>> >> >>>>>> However, runs fine when I do *not* use substring() on the value >> (see >> >>>>>> map function in code above). >> >>>>>> >> >>>>>> This runs fine and completes successfully: >> >>>>>> String str = val.toString(); >> >>>>>> >> >>>>>> This causes error and fails: >> >>>>>> String str = val.toString().substring(0,10); >> >>>>>> >> >>>>>> Please let me know if you need any further information. >> >>>>>> It would be greatly appreciated if anyone could shed some light on >> this problem. >> >>>>> >> >>>>> It catches attention that changing the code to use a substring is >> >>>>> causing a difference. Assuming it is consistent and not a red >> herring, >> >>>> >> >>>> Yes, this has been consistent over the last week. I was running >> 0.20.1 >> >>>> first and then >> >>>> upgrade to 0.20.2 but results have been exactly the same. >> >>>> >> >>>>> can you look at the counters for the two jobs using the JobTracker >> web >> >>>>> UI - things like map records, bytes etc and see if there is a >> >>>>> noticeable difference ? >> >>>> >> >>>> Ok, so here is the first job using write.set(value.toString()); >> having >> >>>> *no* errors: >> >>>> http://pastebin.com/xvy0iGwL >> >>>> >> >>>> And here is the second job using >> >>>> write.set(value.toString().substring(0, 10)); that fails: >> >>>> http://pastebin.com/uGw6yNqv >> >>>> >> >>>> And here is even another where I used a longer, and therefore unique >> string, >> >>>> by write.set(value.toString().substring(0, 20)); This makes every >> line >> >>>> unique, similar to first job. >> >>>> Still fails. >> >>>> http://pastebin.com/GdQ1rp8i >> >>>> >> >>>>>Also, are the two programs being run against >> >>>>> the exact same input data ? >> >>>> >> >>>> Yes, exactly the same input: a single csv file with 23K lines. >> >>>> Using a shorter string leads to more like keys and therefore more >> >>>> combining/reducing, but going >> >>>> by the above it seems to fail whether the substring/key is entirely >> >>>> unique (23000 combine output records) or >> >>>> mostly the same (9 combine output records). >> >>>> >> >>>>> >> >>>>> Also, since the cluster size is small, you could also look at the >> >>>>> tasktracker logs on the machines where the maps have run to see if >> >>>>> there are any failures when the reduce attempts start failing. >> >>>> >> >>>> Here is the TT log from the last failed job. I do not see anything >> >>>> besides the shuffle failure, but there >> >>>> may be something I am overlooking or simply do not understand. >> >>>> http://pastebin.com/DKFTyGXg >> >>>> >> >>>> Thanks again! >> >>>> >> >>>>> >> >>>>> Thanks >> >>>>> Hemanth >> >>>>> >> >>>> >> >>> >> >> >> > >> > > -- Todd Lipcon Software Engineer, Cloudera
