Kostas Sakellis created SPARK-4630:
--------------------------------------

             Summary: Dynamically determine optimal number of partitions
                 Key: SPARK-4630
                 URL: https://issues.apache.org/jira/browse/SPARK-4630
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Kostas Sakellis


Partition sizes play a big part in how fast stages execute during a Spark job. 
There is a direct relationship between the size of partitions to the number of 
tasks - larger partitions, fewer tasks. For better performance, Spark has a 
sweet spot for how large partitions should be that get executed by a task. If 
partitions are too small, then the user pays a disproportionate cost in 
scheduling overhead. If the partitions are too large, then task execution slows 
down due to gc pressure and spilling to disk.

To increase performance of jobs, users often hand optimize the number(size) of 
partitions that the next stage gets. Factors that come into play are:
Incoming partition sizes from previous stage
number of available executors
available memory per executor (taking into account spark.shuffle.memoryFraction)

Spark has access to this data and so should be able to automatically do the 
partition sizing for the user. This feature can be turned off/on with a 
configuration option. 

To make this happen, we propose modifying the DAGScheduler to take into account 
partition sizes upon stage completion. Before scheduling the next stage, the 
scheduler can examine the sizes of the partitions and determine the appropriate 
number tasks to create. Since this change requires non-trivial modifications to 
the DAGScheduler, a detailed design doc will be attached before proceeding with 
the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to