Hi Stuart,

Join is a higher level logical operation while map/reduce is a technique that 
could be used implement it. Specifically, in relational algebra, the join 
construct specifies how to form a single output row from 2 rows arising from 
two input streams. There are very many ways of implementing this logical 
operation and traditional database systems have a number of such 
implementations. Map/reduce being a system that essential allows you to cluster 
data by doing a distributed sort, is amenable to the sort based techinque for 
doing the join. A particular implementation of the reducer gets a combined 
stream of data from the two or more input streams such that they match on the 
key. It then proceeds to generate the cartesian product of the rows from the 
imput streams. In order to implement a join, you need to implement this join 
reducer yourself which is what org.apache.hadoop.mapred.join does. I hope that 
clears up the confusion.

Cheers,
Ashish


-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Thu 7/3/2008 7:54 AM
To: core-user@hadoop.apache.org
Subject: Difference between joining and reducing
 
Hello all,

After recent talk about joins, I have a (possibly) stupid question:

What is the difference between the "join" operations in
o.a.h.mapred.join and the standard merge step in a MapReduce job?

I understand that doing a join in the Mapper would be much more
efficient if you're lucky enough to have your input pre-sorted and
-partitioned.

But how is a join operation in the Reducer any different from the
shuffle/sort/merge that the MapReduce framework already does?

Be gentle.  Thanks,
-Stuart

Reply via email to