a join needs a partitioner, and will shuffle the data as needed for the given partitioner (or if the data is already partitioned then it will leave it alone), after which it will process with something like a map-side join.
if you can specify a partitioner that meets the exact layout of your data in mongo then the shuffle should be harmless for your mongo data (but will still be done, i don't think there is a way yet to tell spark to trust you that its already shuffled), and only the other dataset would get shuffled to meet your mongo layout. On Wed, Dec 16, 2015 at 5:23 AM, sparkuser2345 <hm.spark.u...@gmail.com> wrote: > Is there a way to prevent an RDD from shuffling in a join operation without > repartitioning it? > > I'm reading an RDD from sharded MongoDB, joining that with an RDD of > incoming data (+ some additional calculations), and writing the resulting > RDD back to MongoDB. It would make sense to shuffle only the incoming data > RDD so that the joined RDD would already be partitioned correctly according > to the MondoDB shard key. > > I know I can prevent an RDD from shuffling in a join operation by > partitioning it beforehand but partitioning would already shuffle the RDD. > In addition, I'm only doing the join once per RDD read from MongoDB. Is > there a way to tell Spark to shuffle only the incoming data RDD? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Preventing-an-RDD-from-shuffling-tp25717.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >