Hi, No problems. Thanks so much for your time. Greatly appreciated. I agree that it must be a configuration problem and so today I was able to start from scratch and did a fresh install of 0.20.2 on the 5 node cluster.
I've now noticed that the error occurs when compression is enabled. I've run the basic wordcount example as so: http://pastebin.com/wvDMZZT0 and get the Shuffle Error. TT logs show this error: WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid header checksum: 225702cc (expected 0x2325) Full logs: http://pastebin.com/fVGjcGsW My mapred-site.xml: http://pastebin.com/mQgMrKQw If I remove the compression config settings, the wordcount works fine - no more Shuffle Error. So, I have something wrong with my compression settings I imagine. I'll continue looking into this to see what else I can find out. Thanks a million. On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[email protected]> wrote: > Hi, > > Sorry, I couldn't take a close look at the logs until now. > Unfortunately, I could not see any huge difference between the success > and failure case. Can you please check if things like basic hostname - > ip address mapping are in place (if you have static resolution of > hostnames set up) ? A web search is giving this as the most likely > cause users have faced regarding this problem. Also do the disks have > enough size ? Also, it would be great if you can upload your hadoop > configuration information. > > I do think it is very likely that configuration is the actual problem > because it works in one case anyway. > > Thanks > Hemanth > > On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[email protected]> > wrote: >> Hello, >> I still have had no luck with this over the past week. >> And even get the same exact problem on a completely different 5 node cluster. >> Is it worth opening an new issue in jira for this? >> Thanks >> >> >> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[email protected]> >> wrote: >>> Hello, >>> Thanks so much for the reply. >>> See inline. >>> >>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[email protected]> >>> wrote: >>>> Hi, >>>> >>>>> I've been getting the following error when trying to run a very simple >>>>> MapReduce job. >>>>> Map finishes without problem, but error occurs as soon as it enters >>>>> Reduce phase. >>>>> >>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED >>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>>>> >>>>> I am running a 5 node cluster and I believe I have all my settings >>>>> correct: >>>>> >>>>> * ulimit -n 32768 >>>>> * DNS/RDNS configured properly >>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW >>>>> >>>>> The program is very simple - just counts a unique string in a log file. >>>>> See here: http://pastebin.com/5uRG3SFL >>>>> >>>>> When I run, the job fails and I get the following output. >>>>> http://pastebin.com/AhW6StEb >>>>> >>>>> However, runs fine when I do *not* use substring() on the value (see >>>>> map function in code above). >>>>> >>>>> This runs fine and completes successfully: >>>>> String str = val.toString(); >>>>> >>>>> This causes error and fails: >>>>> String str = val.toString().substring(0,10); >>>>> >>>>> Please let me know if you need any further information. >>>>> It would be greatly appreciated if anyone could shed some light on this >>>>> problem. >>>> >>>> It catches attention that changing the code to use a substring is >>>> causing a difference. Assuming it is consistent and not a red herring, >>> >>> Yes, this has been consistent over the last week. I was running 0.20.1 >>> first and then >>> upgrade to 0.20.2 but results have been exactly the same. >>> >>>> can you look at the counters for the two jobs using the JobTracker web >>>> UI - things like map records, bytes etc and see if there is a >>>> noticeable difference ? >>> >>> Ok, so here is the first job using write.set(value.toString()); having >>> *no* errors: >>> http://pastebin.com/xvy0iGwL >>> >>> And here is the second job using >>> write.set(value.toString().substring(0, 10)); that fails: >>> http://pastebin.com/uGw6yNqv >>> >>> And here is even another where I used a longer, and therefore unique string, >>> by write.set(value.toString().substring(0, 20)); This makes every line >>> unique, similar to first job. >>> Still fails. >>> http://pastebin.com/GdQ1rp8i >>> >>>>Also, are the two programs being run against >>>> the exact same input data ? >>> >>> Yes, exactly the same input: a single csv file with 23K lines. >>> Using a shorter string leads to more like keys and therefore more >>> combining/reducing, but going >>> by the above it seems to fail whether the substring/key is entirely >>> unique (23000 combine output records) or >>> mostly the same (9 combine output records). >>> >>>> >>>> Also, since the cluster size is small, you could also look at the >>>> tasktracker logs on the machines where the maps have run to see if >>>> there are any failures when the reduce attempts start failing. >>> >>> Here is the TT log from the last failed job. I do not see anything >>> besides the shuffle failure, but there >>> may be something I am overlooking or simply do not understand. >>> http://pastebin.com/DKFTyGXg >>> >>> Thanks again! >>> >>>> >>>> Thanks >>>> Hemanth >>>> >>> >> >
