GitHub user aarondav opened a pull request: https://github.com/apache/spark/pull/1138
SPARK-2203: PySpark defaults to use same num reduce partitions as map side For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. JIRA: https://issues.apache.org/jira/browse/SPARK-2203 You can merge this pull request into a Git repository by running: $ git pull https://github.com/aarondav/spark pyfix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1138.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1138 ---- commit 1bd5751fad08b0b2c69f9a0816b6b20fa06621fe Author: Aaron Davidson <aa...@databricks.com> Date: 2014-06-19T19:43:50Z SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. JIRA: https://issues.apache.org/jira/browse/SPARK-2203 ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---