Please file a JIRA for it. On Mon, Jun 15, 2015 at 8:00 AM, mrm <ma...@skimlinks.com> wrote: > Hi all, > > I was looking for an explanation on the number of partitions for a joined > rdd. > > The documentation of Spark 1.3.1. says that: > "For distributed shuffle operations like reduceByKey and join, the largest > number of partitions in a parent RDD." > https://spark.apache.org/docs/latest/configuration.html > > And the Partitioner.scala comments (line 51) state that: > "Unless spark.default.parallelism is set, the number of partitions will be > the same as the number of partitions in the largest upstream RDD, as this > should be least likely to cause out-of-memory errors." > > But this is misleading for the Python API where if you do rddA.join(rddB), > the output number of partitions is the number of partitions of A plus the > number of partitions of B! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/number-of-partitions-in-join-Spark-documentation-misleading-tp23316.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org