First, you should collect().toMap() the small RDD, then you should use
broadcast followed by a map to do a map-side join
<http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf>
(slide
10 has an example).

Spark SQL also does it by default for tables that are smaller than the
spark.sql.autoBroadcastJoinThreshold setting (by default 10 KB, which is
really small, but you can bump this up with set
spark.sql.autoBroadcastJoinThreshold=1000000 for example).

On Mon, Jan 12, 2015 at 3:15 PM, Pala M Muthaia <mchett...@rocketfuelinc.com
> wrote:

> Hi,
>
>
> How do i do broadcast/map join on RDDs? I have a large RDD that i want to
> inner join with a small RDD. Instead of having the large RDD repartitioned
> and shuffled for join, i would rather send a copy of a small RDD to each
> task, and then perform the join locally.
>
> How would i specify this in Spark code? I didn't find much documentation
> online. I attempted to create a broadcast variable out of the small RDD and
> then access that in the join operator:
>
> largeRdd.join(smallRddBroadCastVar.value)
>
> but that didn't work as expected ( I found that all rows with same key
> were on same task)
>
> I am using Spark version 1.0.1
>
>
> Thanks,
> pala
>
>
>

Reply via email to