These numbers are with everything I laid out below.  The job was running 
acceptably until a couple of days ago when a change increased the output of the 
Map phase by about 30%.  I don't think there is anything special about those 
additional keys that would force them all into the same reducer.

Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile
CI BoostT Clients  Outperform OnlineT  www.ciboost.com


-----Original Message-----
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Friday, July 20, 2012 11:56 AM
To: mapreduce-user@hadoop.apache.org
Cc: john.armstr...@ccri.com
Subject: Re: Distributing Keys across Reducers

Does applying a combiner make any difference? Or are these numbers with the 
combiner included?

On Fri, Jul 20, 2012 at 8:46 PM, Dave Shine 
<dave.sh...@channelintelligence.com> wrote:
> Thanks John.
>
> The key is my own WritableComparable object, and I have created custom 
> Comparator, Partitioner, and KeyValueGroupingComparator.  However, they are 
> all pretty generic.  The Key class is has two properties, a boolean and a 
> string.  I'm grouping on just the string, but comparing on both properties to 
> ensure that the reducer receives the "true" values before the "false" values.
>
> My partitioner does the basic hash of just the string portion of the key 
> class.  I'm hoping to find some guidance on how to make that partitioner 
> smarter and avoid this problem.
>
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile CI Boost(tm) Clients  
> Outperform Online(tm)  www.ciboost.com
>
>
> -----Original Message-----
> From: John Armstrong [mailto:j...@ccri.com]
> Sent: Friday, July 20, 2012 10:20 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Distributing Keys across Reducers
>
> On 07/20/2012 09:20 AM, Dave Shine wrote:
>> I believe this is referred to as a "key skew problem", which I know 
>> is heavily dependent on the actual data being processed.  Can anyone 
>> point me to any blog posts, white papers, etc. that might give me 
>> some options on how to deal with this issue?
>
> I don't know about blog posts or white papers, but the canonical answer here 
> is usually using a different Partitioner.
>
> The default one takes the .hash() of each Mapper output key and reduces it 
> modulo the number of Reducers you've specified (43, here).  So the first 
> place I'd look is to see if there's some reason you're getting so many more 
> outputs with one key-hash-mod-43 than the others.
>
> A common answer here is that one key alone has a huge number of outputs, in 
> which case it's hard to do anything better with it.  Another case is that 
> your key class' hash function is bad at telling apart a certain class of keys 
> that occur with some regularity.  Since 43 is an odd prime, I would not 
> expect a moderately evenly distributed hash to suddenly get spikes at certain 
> values mod-43.
>
> So if you want to (and can) rejigger your hashes to spread things more 
> evenly, great.  If not, you're down to writing your own partitioner.
> It's slightly different depending on which API you're using, but either way 
> you basically have to write a function called getPartition that takes a 
> mapper output record (key and value) and the number of reducers and returns 
> the index (from 0 to numReducers-1) of the reducer that should handle that 
> record.  And unless you REALLY know what you're doing, the function should 
> probably only depend on the key.
>
> Good luck.
>
> The information contained in this email message is considered confidential 
> and proprietary to the sender and is intended solely for review and use by 
> the named recipient. Any unauthorized review, use or distribution is strictly 
> prohibited. If you have received this message in error, please advise the 
> sender by reply email and delete the message.



--
Harsh J

Reply via email to