Re: Preventing an RDD from shuffling

Koert Kuipers Wed, 16 Dec 2015 16:26:01 -0800

a join needs a partitioner, and will shuffle the data as needed for the
given partitioner (or if the data is already partitioned then it will leave
it alone), after which it will process with something like a map-side join.


if you can specify a partitioner that meets the exact layout of your data
in mongo then the shuffle should be harmless for your mongo data (but will
still be done, i don't think there is a way yet to tell spark to trust you
that its already shuffled), and only the other dataset would get shuffled
to meet your mongo layout.

On Wed, Dec 16, 2015 at 5:23 AM, sparkuser2345 <hm.spark.u...@gmail.com>
wrote:

> Is there a way to prevent an RDD from shuffling in a join operation without
> repartitioning it?
>
> I'm reading an RDD from sharded MongoDB, joining that with an RDD of
> incoming data (+ some additional calculations), and writing the resulting
> RDD back to MongoDB. It would make sense to shuffle only the incoming data
> RDD so that the joined RDD would already be partitioned correctly according
> to the MondoDB shard key.
>
> I know I can prevent an RDD from shuffling in a join operation by
> partitioning it beforehand but partitioning would already shuffle the RDD.
> In addition, I'm only doing the join once per RDD read from MongoDB. Is
> there a way to tell Spark to shuffle only the incoming data RDD?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Preventing-an-RDD-from-shuffling-tp25717.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Preventing an RDD from shuffling

Reply via email to