Re: number of partitions in join: Spark documentation misleading!

Davies Liu Tue, 16 Jun 2015 20:41:14 -0700

Please file a JIRA for it.

On Mon, Jun 15, 2015 at 8:00 AM, mrm <ma...@skimlinks.com> wrote:
> Hi all,
>
> I was looking for an explanation on the number of partitions for a joined
> rdd.
>
> The documentation of Spark 1.3.1. says that:
> "For distributed shuffle operations like reduceByKey and join, the largest
> number of partitions in a parent RDD."
> https://spark.apache.org/docs/latest/configuration.html
>
> And the Partitioner.scala comments (line 51) state that:
> "Unless spark.default.parallelism is set, the number of partitions will be
> the same as the number of partitions in the largest upstream RDD, as this
> should be least likely to cause out-of-memory errors."
>
> But this is misleading for the Python API where if you do rddA.join(rddB),
> the output number of partitions is the number of partitions of A plus the
> number of partitions of B!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/number-of-partitions-in-join-Spark-documentation-misleading-tp23316.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: number of partitions in join: Spark documentation misleading!

Reply via email to