Hi all,

I was looking for an explanation on the number of partitions for a joined
rdd.

The documentation of Spark 1.3.1. says that:
"For distributed shuffle operations like reduceByKey and join, the largest
number of partitions in a parent RDD."
https://spark.apache.org/docs/latest/configuration.html

And the Partitioner.scala comments (line 51) state that:
"Unless spark.default.parallelism is set, the number of partitions will be
the same as the number of partitions in the largest upstream RDD, as this
should be least likely to cause out-of-memory errors."

But this is misleading for the Python API where if you do rddA.join(rddB),
the output number of partitions is the number of partitions of A plus the
number of partitions of B!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/number-of-partitions-in-join-Spark-documentation-misleading-tp23316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to