On Wed, Jun 16, 2010 at 9:16 AM, Alan Gates <ga...@yahoo-inc.com> wrote:
> > > 4. for non-hot keys, my understanding is that they are shuffled to reducers >> based on default hash partitioner. However, it could happen all the keys >> shuffled to one reducers incurs skew even none of them is skewed >> individually. >> > This is always the case in map reduce, though a good hash function should > minimize the occurrences of this. Plus this isn't the sort of skew that kills jobs. What we are worried about with skew join is the amount of memory needed to buffer all records for a single key -- if that amount is greater than available memory, we start swapping and things get very bad. If all that's happening is that a partition has a few more non-skewed keys than other partitions, it will still be able to make progress perfectly fine, it'll just run a bit longer; not a big deal in the grand scheme of things.