Re: Broadcast joins on RDD

2015-01-12 Thread Reza Zadeh
First, you should collect().toMap() the small RDD, then you should use broadcast followed by a map to do a map-side join http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf (slide 10 has an example). Spark SQL also does it by default for tables

Broadcast joins on RDD

2015-01-12 Thread Pala M Muthaia
Hi, How do i do broadcast/map join on RDDs? I have a large RDD that i want to inner join with a small RDD. Instead of having the large RDD repartitioned and shuffled for join, i would rather send a copy of a small RDD to each task, and then perform the join locally. How would i specify this in