This is particularly useful if your input is the output of another MR job, else 
is a killer.
You may want to write your own mapper in case one of the files to be joined is 
small enough to fit in memory / can be handled in splits.

Thanks,
Amogh

-----Original Message-----
From: Jason Venner [mailto:jason.had...@gmail.com] 
Sent: Thursday, July 30, 2009 8:20 PM
To: common-user@hadoop.apache.org
Subject: Re: map side join

The mapside join code builds multiple map tasks, each map task will receive
as input
one partition from each of your input sources.

In your case, your job would have 3 map tasks, and each map task would be
receive data from 1 partition in each source file.

The mapside join code maintains a reader open for each input file in the
input split and produces key value sets via a stream merge sort of the these
input data files.
The merge is essentially done key by key before the key, value set is
presented to the map.

Implicit in the mapside join is that the input files are already sorted, so
the join code only has to figure out which key is next out of the set of
input files in the task.

On Wed, Jul 29, 2009 at 8:48 AM, bonito <bonito.pe...@gmail.com> wrote:

>
> Hello,
> I would like to ask a question regarding the map side join. I am trying to
> understand the implementation of it and I would be
> grateful if you could tell me whether there is any I/O cost included.
> In detail,
> if we have 2 source files of 3 splits each (so as to ensure the constraints
> that is, sorted, partitioned etc.) then during map side join these 2 files
> are merged before the map function takes place.
> I am trying to comprehend how this merge is done. If I am not mistaken,
> each
> pair of corresponding splits is merged at a time. That is, first the
> splits(1) of both sources are taken into account.
>
> How? Is this done in a 'on the fly' fashion  (in-memory buffer)? Is there
> any file locally created?
>
> I read the relevant details about the iterators but I wonder about the
> memory requirements... If each split need to be in-memory stored so as to
> have an iterator over it, then there should be a requirement of memory
> space.
>
> Thank you!
>
>
> --
> View this message in context:
> http://www.nabble.com/map-side-join-tp24722077p24722077.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Reply via email to