GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/1138

    SPARK-2203: PySpark defaults to use same num reduce partitions as map side

    For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), 
PySpark will always assume that the default parallelism to use for the reduce 
side is ctx.defaultParallelism, which is a constant typically determined by the 
number of cores in cluster.
    
    In contrast, Spark's Partitioner#defaultPartitioner will use the same 
number of reduce partitions as map partitions unless the defaultParallelism 
config is explicitly set. This tends to be a better default in order to avoid 
OOMs, and should also be the behavior of PySpark.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2203

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark pyfix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1138.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1138
    
----
commit 1bd5751fad08b0b2c69f9a0816b6b20fa06621fe
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-06-19T19:43:50Z

    SPARK-2203: PySpark defaults to use same num reduce partitions as map 
partitions
    
    For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), 
PySpark will always
    assume that the default parallelism to use for the reduce side is 
ctx.defaultParallelism,
    which is a constant typically determined by the number of cores in cluster.
    
    In contrast, Spark's Partitioner#defaultPartitioner will use the same 
number of reduce
    partitions as map partitions unless the defaultParallelism config is 
explicitly set. This
    tends to be a better default in order to avoid OOMs, and should also be the 
behavior of PySpark.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2203

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to