These numbers are with everything I laid out below. The job was running acceptably until a couple of days ago when a change increased the output of the Map phase by about 30%. I don't think there is anything special about those additional keys that would force them all into the same reducer.
Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI BoostT Clients Outperform OnlineT www.ciboost.com -----Original Message----- From: Harsh J [mailto:ha...@cloudera.com] Sent: Friday, July 20, 2012 11:56 AM To: mapreduce-user@hadoop.apache.org Cc: john.armstr...@ccri.com Subject: Re: Distributing Keys across Reducers Does applying a combiner make any difference? Or are these numbers with the combiner included? On Fri, Jul 20, 2012 at 8:46 PM, Dave Shine <dave.sh...@channelintelligence.com> wrote: > Thanks John. > > The key is my own WritableComparable object, and I have created custom > Comparator, Partitioner, and KeyValueGroupingComparator. However, they are > all pretty generic. The Key class is has two properties, a boolean and a > string. I'm grouping on just the string, but comparing on both properties to > ensure that the reducer receives the "true" values before the "false" values. > > My partitioner does the basic hash of just the string portion of the key > class. I'm hoping to find some guidance on how to make that partitioner > smarter and avoid this problem. > > Dave Shine > Sr. Software Engineer > 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients > Outperform Online(tm) www.ciboost.com > > > -----Original Message----- > From: John Armstrong [mailto:j...@ccri.com] > Sent: Friday, July 20, 2012 10:20 AM > To: mapreduce-user@hadoop.apache.org > Subject: Re: Distributing Keys across Reducers > > On 07/20/2012 09:20 AM, Dave Shine wrote: >> I believe this is referred to as a "key skew problem", which I know >> is heavily dependent on the actual data being processed. Can anyone >> point me to any blog posts, white papers, etc. that might give me >> some options on how to deal with this issue? > > I don't know about blog posts or white papers, but the canonical answer here > is usually using a different Partitioner. > > The default one takes the .hash() of each Mapper output key and reduces it > modulo the number of Reducers you've specified (43, here). So the first > place I'd look is to see if there's some reason you're getting so many more > outputs with one key-hash-mod-43 than the others. > > A common answer here is that one key alone has a huge number of outputs, in > which case it's hard to do anything better with it. Another case is that > your key class' hash function is bad at telling apart a certain class of keys > that occur with some regularity. Since 43 is an odd prime, I would not > expect a moderately evenly distributed hash to suddenly get spikes at certain > values mod-43. > > So if you want to (and can) rejigger your hashes to spread things more > evenly, great. If not, you're down to writing your own partitioner. > It's slightly different depending on which API you're using, but either way > you basically have to write a function called getPartition that takes a > mapper output record (key and value) and the number of reducers and returns > the index (from 0 to numReducers-1) of the reducer that should handle that > record. And unless you REALLY know what you're doing, the function should > probably only depend on the key. > > Good luck. > > The information contained in this email message is considered confidential > and proprietary to the sender and is intended solely for review and use by > the named recipient. Any unauthorized review, use or distribution is strictly > prohibited. If you have received this message in error, please advise the > sender by reply email and delete the message. -- Harsh J