Re: Distributing Keys across Reducers

John Armstrong Fri, 20 Jul 2012 07:21:17 -0700

On 07/20/2012 09:20 AM, Dave Shine wrote:

I believe this is referred to as a “key skew problem”, which I know is
heavily dependent on the actual data being processed.  Can anyone point
me to any blog posts, white papers, etc. that might give me some options
on how to deal with this issue?

I don't know about blog posts or white papers, but the canonical answerhere is usually using a different Partitioner.

The default one takes the .hash() of each Mapper output key and reducesit modulo the number of Reducers you've specified (43, here). So thefirst place I'd look is to see if there's some reason you're getting somany more outputs with one key-hash-mod-43 than the others.

A common answer here is that one key alone has a huge number of outputs,in which case it's hard to do anything better with it. Another case isthat your key class' hash function is bad at telling apart a certainclass of keys that occur with some regularity. Since 43 is an oddprime, I would not expect a moderately evenly distributed hash tosuddenly get spikes at certain values mod-43.

So if you want to (and can) rejigger your hashes to spread things moreevenly, great. If not, you're down to writing your own partitioner.It's slightly different depending on which API you're using, but eitherway you basically have to write a function called getPartition thattakes a mapper output record (key and value) and the number of reducersand returns the index (from 0 to numReducers-1) of the reducer thatshould handle that record. And unless you REALLY know what you'redoing, the function should probably only depend on the key.


Good luck.

Re: Distributing Keys across Reducers

Reply via email to