There is a way and it's called "map-side-join". To be clear, there is no
explicit function call/API to execute a map-side-join. You have to code it
using a local/broadcast value combined with the map() function. A caveat
for this to work is that one side of the join must be small-ish to exist as
a local/broadcast value. A description of what you're trying to achieve is
a partition local join via the map function. The results are equivalent to
a join but avoids a cluster wide shuffle.

Read the pdf below and look for "Example: Join". This will explain how
joins work in Spark and how you can try to optimize it with a map-side-join
(if your use case fits).
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf

HTH,
Duc



On Wed, Dec 16, 2015 at 3:23 AM, sparkuser2345 <hm.spark.u...@gmail.com>
wrote:

> Is there a way to prevent an RDD from shuffling in a join operation without
> repartitioning it?
>
> I'm reading an RDD from sharded MongoDB, joining that with an RDD of
> incoming data (+ some additional calculations), and writing the resulting
> RDD back to MongoDB. It would make sense to shuffle only the incoming data
> RDD so that the joined RDD would already be partitioned correctly according
> to the MondoDB shard key.
>
> I know I can prevent an RDD from shuffling in a join operation by
> partitioning it beforehand but partitioning would already shuffle the RDD.
> In addition, I'm only doing the join once per RDD read from MongoDB. Is
> there a way to tell Spark to shuffle only the incoming data RDD?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Preventing-an-RDD-from-shuffling-tp25717.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to