Hi Pankil,

Basically there are two steps here - the first is to sort the two files.
This can be done using an mapreduce where the mapper extracts the join
column as a key.

If you make sure you have the same number of reducers (and partition by the
equijoin column) for both sorts, then you'll end up with:

A        B
part-0  part-0
part-1  part-1

etc

Each corresponding part file will be in sorted order, and you can perform
the merge.

To do the merge, you can just pick either A or B as your input for locality
hints, and then, in the mapper, given the file name, determine the filename
of the other partition. Open that up as a side input in your mapper and
perform the merge like you would in a non-distributed setting.

Hope this helps
-Todd


On Thu, Jul 9, 2009 at 9:09 AM, Pankil Doshi <forpan...@gmail.com> wrote:

> Hi,
>
> Does anyone has hint on how to implement "SORT-MERGE JOIN" using map-reduce
> paradigm?
> I read article regarding it on Pig wiki but did not got clarity as it
> doesn't show in form of map and reduce.
>
> Pankil
>

Reply via email to