Hi all, I was looking for an explanation on the number of partitions for a joined rdd.
The documentation of Spark 1.3.1. says that: "For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD." https://spark.apache.org/docs/latest/configuration.html And the Partitioner.scala comments (line 51) state that: "Unless spark.default.parallelism is set, the number of partitions will be the same as the number of partitions in the largest upstream RDD, as this should be least likely to cause out-of-memory errors." But this is misleading for the Python API where if you do rddA.join(rddB), the output number of partitions is the number of partitions of A plus the number of partitions of B! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/number-of-partitions-in-join-Spark-documentation-misleading-tp23316.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org