Re: skew join in pig

Alan Gates Wed, 16 Jun 2010 09:17:40 -0700


On Jun 16, 2010, at 8:36 AM, Gang Luo wrote:

Hi,
there is something confusing me in the skew join (http://wiki.apache.org/pig/PigSkewedJoinSpec)1. does the sampling job sample and build histogram on both tables,or just one table (in this case, which one) ?

Just the left one.

2. the join job still take the two table as inputs, and shuffletuples from partitioned table to particular reducer (one tuple toone reducer), and shuffle tuples from streamed table to all reducersassociative to one partition (one tuple to multiple reducers). Isthat correct?

Keys with small enough values to fit in memory are shuffled toreducers as normal. Keys that are too large are split betweenreducers on the left side, and replicated to all of those reducersthat have the splits (not all reducers) on the right side. Does thatanswer your question?

3. Hot keys need more than one reducers. Are these reducersdedicated to this key only? Could they also take other keys at thesame time?

They take other keys at the same time.

4. for non-hot keys, my understanding is that they are shuffled toreducers based on default hash partitioner. However, it could happenall the keys shuffled to one reducers incurs skew even none of themis skewed individually.

This is always the case in map reduce, though a good hash functionshould minimize the occurrences of this.


Can someone give me some ideas on these? Thanks.

-Gang

Alan.

Re: skew join in pig

Reply via email to