Hi, Sorry, I couldn't take a close look at the logs until now. Unfortunately, I could not see any huge difference between the success and failure case. Can you please check if things like basic hostname - ip address mapping are in place (if you have static resolution of hostnames set up) ? A web search is giving this as the most likely cause users have faced regarding this problem. Also do the disks have enough size ? Also, it would be great if you can upload your hadoop configuration information.
I do think it is very likely that configuration is the actual problem because it works in one case anyway. Thanks Hemanth On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[email protected]> wrote: > Hello, > I still have had no luck with this over the past week. > And even get the same exact problem on a completely different 5 node cluster. > Is it worth opening an new issue in jira for this? > Thanks > > > On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[email protected]> > wrote: >> Hello, >> Thanks so much for the reply. >> See inline. >> >> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[email protected]> >> wrote: >>> Hi, >>> >>>> I've been getting the following error when trying to run a very simple >>>> MapReduce job. >>>> Map finishes without problem, but error occurs as soon as it enters >>>> Reduce phase. >>>> >>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >>>> attempt_201006241812_0001_r_000000_0, Status : FAILED >>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>>> >>>> I am running a 5 node cluster and I believe I have all my settings correct: >>>> >>>> * ulimit -n 32768 >>>> * DNS/RDNS configured properly >>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >>>> * mapred-site.xml : http://pastebin.com/JraVQZcW >>>> >>>> The program is very simple - just counts a unique string in a log file. >>>> See here: http://pastebin.com/5uRG3SFL >>>> >>>> When I run, the job fails and I get the following output. >>>> http://pastebin.com/AhW6StEb >>>> >>>> However, runs fine when I do *not* use substring() on the value (see >>>> map function in code above). >>>> >>>> This runs fine and completes successfully: >>>> String str = val.toString(); >>>> >>>> This causes error and fails: >>>> String str = val.toString().substring(0,10); >>>> >>>> Please let me know if you need any further information. >>>> It would be greatly appreciated if anyone could shed some light on this >>>> problem. >>> >>> It catches attention that changing the code to use a substring is >>> causing a difference. Assuming it is consistent and not a red herring, >> >> Yes, this has been consistent over the last week. I was running 0.20.1 >> first and then >> upgrade to 0.20.2 but results have been exactly the same. >> >>> can you look at the counters for the two jobs using the JobTracker web >>> UI - things like map records, bytes etc and see if there is a >>> noticeable difference ? >> >> Ok, so here is the first job using write.set(value.toString()); having >> *no* errors: >> http://pastebin.com/xvy0iGwL >> >> And here is the second job using >> write.set(value.toString().substring(0, 10)); that fails: >> http://pastebin.com/uGw6yNqv >> >> And here is even another where I used a longer, and therefore unique string, >> by write.set(value.toString().substring(0, 20)); This makes every line >> unique, similar to first job. >> Still fails. >> http://pastebin.com/GdQ1rp8i >> >>>Also, are the two programs being run against >>> the exact same input data ? >> >> Yes, exactly the same input: a single csv file with 23K lines. >> Using a shorter string leads to more like keys and therefore more >> combining/reducing, but going >> by the above it seems to fail whether the substring/key is entirely >> unique (23000 combine output records) or >> mostly the same (9 combine output records). >> >>> >>> Also, since the cluster size is small, you could also look at the >>> tasktracker logs on the machines where the maps have run to see if >>> there are any failures when the reduce attempts start failing. >> >> Here is the TT log from the last failed job. I do not see anything >> besides the shuffle failure, but there >> may be something I am overlooking or simply do not understand. >> http://pastebin.com/DKFTyGXg >> >> Thanks again! >> >>> >>> Thanks >>> Hemanth >>> >> >
