Re: Difference between joining and reducing

Chris Douglas Thu, 03 Jul 2008 11:59:17 -0700

Ashish ably outlined the differences between a join and a merge, butmight be confusing the o.a.h.mapred.join package and the contrib/data_join framework. The former is used for map-side joins and hasnothing to do with either the shuffle or the reduce; the lattereffects joins in the reduce.

The critical difference between the merge phase in map/reduce and ajoin is that merge outputs are grouped by a comparator and consumed insorted order while, in contrast, joins involve n datasets andconsumers will consider the cartesian product of selected keys (inboth frameworks, equal keys). The practical differences between thetwo aforementioned join frameworks involve tradeoffs in efficiency andconstraints on input data. -C


On Jul 3, 2008, at 7:54 AM, Stuart Sierra wrote:

Hello all,

After recent talk about joins, I have a (possibly) stupid question:

What is the difference between the "join" operations in
o.a.h.mapred.join and the standard merge step in a MapReduce job?

I understand that doing a join in the Mapper would be much more
efficient if you're lucky enough to have your input pre-sorted and
-partitioned.

But how is a join operation in the Reducer any different from the
shuffle/sort/merge that the MapReduce framework already does?

Be gentle.  Thanks,
-Stuart

Re: Difference between joining and reducing

Reply via email to