You need to use broadcast followed by flatMap or mapPartitions to do map-side 
joins (in your map function, you can look at the hash table you broadcast and 
see what records match it). Spark SQL also does it by default for tables 
smaller than the spark.sql.autoBroadcastJoinThreshold setting (by default 10 
KB, which is really small, but you can bump this up with set 
spark.sql.autoBroadcastJoinThreshold=1000000 for example).

Matei

> On Nov 3, 2014, at 1:03 PM, Shuai Zheng <szheng.c...@gmail.com> wrote:
> 
> Hi All,
> 
> I have spent last two years on hadoop but new to spark.
> I am planning to move one of my existing system to spark to get some enhanced 
> features.
> 
> My question is:
> 
> If I try to do a map side join (something similar to "Replicated" key word in 
> Pig), how can I do it? Is it anyway to declare a RDD as "replicated" (means 
> distribute it to all nodes and each node will have a full copy)?
> 
> I know I can use accumulator to get this feature, but I am not sure what is 
> the best practice. And if I accumulator to broadcast the data set, can then 
> (after broadcast) convert it into a RDD and do the join?
> 
> Regards,
> 
> Shuai


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to