Please file a JIRA for it.
On Mon, Jun 15, 2015 at 8:00 AM, mrm ma...@skimlinks.com wrote:
Hi all,
I was looking for an explanation on the number of partitions for a joined
rdd.
The documentation of Spark 1.3.1. says that:
For distributed shuffle operations like reduceByKey and join, the largest
number of partitions in a parent RDD.
https://spark.apache.org/docs/latest/configuration.html
And the Partitioner.scala comments (line 51) state that:
Unless spark.default.parallelism is set, the number of partitions will be
the same as the number of partitions in the largest upstream RDD, as this
should be least likely to cause out-of-memory errors.
But this is misleading for the Python API where if you do rddA.join(rddB),
the output number of partitions is the number of partitions of A plus the
number of partitions of B!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/number-of-partitions-in-join-Spark-documentation-misleading-tp23316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org