Hi,

You could check DistributedCache (
http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
It would allow you to distribute data to the nodes where your tasks are run.

Thanks
Hemanth

On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
sigurd.spieckerm...@gmail.com> wrote:

> Hi,
>
> I would like to perform a map-side join of two large datasets where
> dataset A consists of m*n elements and dataset B consists of n elements.
> For the join, every element in dataset B needs to be accessed m times. Each
> mapper would join one element from A with the corresponding element from B.
> Elements here are actually data blocks. Is there a performance problem (and
> difference compared to a slightly modified map-side join using the
> join-package) if I set dataset A as the map-reduce input and load the
> relevant element from dataset B directly from the HDFS inside the mapper? I
> could store the elements of B in a MapFile for faster random access. In the
> second case without the join-package I would not have to partition the
> datasets manually which would allow a bit more flexibility, but I'm
> wondering if HDFS access from inside a mapper is strictly bad. Also, does
> Hadoop have a cache for such situations by any chance?
>
> I appreciate any comments!
>
> Sigurd
>

Reply via email to