Ashish ably outlined the differences between a join and a merge, but might be confusing the o.a.h.mapred.join package and the contrib/ data_join framework. The former is used for map-side joins and has nothing to do with either the shuffle or the reduce; the latter effects joins in the reduce.

The critical difference between the merge phase in map/reduce and a join is that merge outputs are grouped by a comparator and consumed in sorted order while, in contrast, joins involve n datasets and consumers will consider the cartesian product of selected keys (in both frameworks, equal keys). The practical differences between the two aforementioned join frameworks involve tradeoffs in efficiency and constraints on input data. -C

On Jul 3, 2008, at 7:54 AM, Stuart Sierra wrote:

Hello all,

After recent talk about joins, I have a (possibly) stupid question:

What is the difference between the "join" operations in
o.a.h.mapred.join and the standard merge step in a MapReduce job?

I understand that doing a join in the Mapper would be much more
efficient if you're lucky enough to have your input pre-sorted and
-partitioned.

But how is a join operation in the Reducer any different from the
shuffle/sort/merge that the MapReduce framework already does?

Be gentle.  Thanks,
-Stuart

Reply via email to