Hello, I still have had no luck with this over the past week. And even get the same exact problem on a completely different 5 node cluster. Is it worth opening an new issue in jira for this? Thanks
On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[email protected]> wrote: > Hello, > Thanks so much for the reply. > See inline. > > On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[email protected]> wrote: >> Hi, >> >>> I've been getting the following error when trying to run a very simple >>> MapReduce job. >>> Map finishes without problem, but error occurs as soon as it enters >>> Reduce phase. >>> >>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >>> attempt_201006241812_0001_r_000000_0, Status : FAILED >>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>> >>> I am running a 5 node cluster and I believe I have all my settings correct: >>> >>> * ulimit -n 32768 >>> * DNS/RDNS configured properly >>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >>> * mapred-site.xml : http://pastebin.com/JraVQZcW >>> >>> The program is very simple - just counts a unique string in a log file. >>> See here: http://pastebin.com/5uRG3SFL >>> >>> When I run, the job fails and I get the following output. >>> http://pastebin.com/AhW6StEb >>> >>> However, runs fine when I do *not* use substring() on the value (see >>> map function in code above). >>> >>> This runs fine and completes successfully: >>> String str = val.toString(); >>> >>> This causes error and fails: >>> String str = val.toString().substring(0,10); >>> >>> Please let me know if you need any further information. >>> It would be greatly appreciated if anyone could shed some light on this >>> problem. >> >> It catches attention that changing the code to use a substring is >> causing a difference. Assuming it is consistent and not a red herring, > > Yes, this has been consistent over the last week. I was running 0.20.1 > first and then > upgrade to 0.20.2 but results have been exactly the same. > >> can you look at the counters for the two jobs using the JobTracker web >> UI - things like map records, bytes etc and see if there is a >> noticeable difference ? > > Ok, so here is the first job using write.set(value.toString()); having > *no* errors: > http://pastebin.com/xvy0iGwL > > And here is the second job using > write.set(value.toString().substring(0, 10)); that fails: > http://pastebin.com/uGw6yNqv > > And here is even another where I used a longer, and therefore unique string, > by write.set(value.toString().substring(0, 20)); This makes every line > unique, similar to first job. > Still fails. > http://pastebin.com/GdQ1rp8i > >>Also, are the two programs being run against >> the exact same input data ? > > Yes, exactly the same input: a single csv file with 23K lines. > Using a shorter string leads to more like keys and therefore more > combining/reducing, but going > by the above it seems to fail whether the substring/key is entirely > unique (23000 combine output records) or > mostly the same (9 combine output records). > >> >> Also, since the cluster size is small, you could also look at the >> tasktracker logs on the machines where the maps have run to see if >> there are any failures when the reduce attempts start failing. > > Here is the TT log from the last failed job. I do not see anything > besides the shuffle failure, but there > may be something I am overlooking or simply do not understand. > http://pastebin.com/DKFTyGXg > > Thanks again! > >> >> Thanks >> Hemanth >> >
