Hi, You could check DistributedCache ( http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache). It would allow you to distribute data to the nodes where your tasks are run.
Thanks Hemanth On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann < sigurd.spieckerm...@gmail.com> wrote: > Hi, > > I would like to perform a map-side join of two large datasets where > dataset A consists of m*n elements and dataset B consists of n elements. > For the join, every element in dataset B needs to be accessed m times. Each > mapper would join one element from A with the corresponding element from B. > Elements here are actually data blocks. Is there a performance problem (and > difference compared to a slightly modified map-side join using the > join-package) if I set dataset A as the map-reduce input and load the > relevant element from dataset B directly from the HDFS inside the mapper? I > could store the elements of B in a MapFile for faster random access. In the > second case without the join-package I would not have to partition the > datasets manually which would allow a bit more flexibility, but I'm > wondering if HDFS access from inside a mapper is strictly bad. Also, does > Hadoop have a cache for such situations by any chance? > > I appreciate any comments! > > Sigurd >