First, you should collect().toMap() the small RDD, then you should use
broadcast followed by a map to do a map-side join
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
(slide
10 has an example).
Spark SQL also does it by default for tables
Hi,
How do i do broadcast/map join on RDDs? I have a large RDD that i want to
inner join with a small RDD. Instead of having the large RDD repartitioned
and shuffled for join, i would rather send a copy of a small RDD to each
task, and then perform the join locally.
How would i specify this in