Is there a way to prevent an RDD from shuffling in a join operation without repartitioning it?
I'm reading an RDD from sharded MongoDB, joining that with an RDD of incoming data (+ some additional calculations), and writing the resulting RDD back to MongoDB. It would make sense to shuffle only the incoming data RDD so that the joined RDD would already be partitioned correctly according to the MondoDB shard key. I know I can prevent an RDD from shuffling in a join operation by partitioning it beforehand but partitioning would already shuffle the RDD. In addition, I'm only doing the join once per RDD read from MongoDB. Is there a way to tell Spark to shuffle only the incoming data RDD? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Preventing-an-RDD-from-shuffling-tp25717.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org