There is a way and it's called "map-side-join". To be clear, there is no explicit function call/API to execute a map-side-join. You have to code it using a local/broadcast value combined with the map() function. A caveat for this to work is that one side of the join must be small-ish to exist as a local/broadcast value. A description of what you're trying to achieve is a partition local join via the map function. The results are equivalent to a join but avoids a cluster wide shuffle.
Read the pdf below and look for "Example: Join". This will explain how joins work in Spark and how you can try to optimize it with a map-side-join (if your use case fits). http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf HTH, Duc On Wed, Dec 16, 2015 at 3:23 AM, sparkuser2345 <hm.spark.u...@gmail.com> wrote: > Is there a way to prevent an RDD from shuffling in a join operation without > repartitioning it? > > I'm reading an RDD from sharded MongoDB, joining that with an RDD of > incoming data (+ some additional calculations), and writing the resulting > RDD back to MongoDB. It would make sense to shuffle only the incoming data > RDD so that the joined RDD would already be partitioned correctly according > to the MondoDB shard key. > > I know I can prevent an RDD from shuffling in a join operation by > partitioning it beforehand but partitioning would already shuffle the RDD. > In addition, I'm only doing the join once per RDD read from MongoDB. Is > there a way to tell Spark to shuffle only the incoming data RDD? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Preventing-an-RDD-from-shuffling-tp25717.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >