[jira] [Assigned] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4630: --- Assignee: Kostas Sakellis (was: Apache Spark) > Dynamically determine optimal number of partitions > -- > > Key: SPARK-4630 > URL: https://issues.apache.org/jira/browse/SPARK-4630 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kostas Sakellis >Assignee: Kostas Sakellis > > Partition sizes play a big part in how fast stages execute during a Spark > job. There is a direct relationship between the size of partitions to the > number of tasks - larger partitions, fewer tasks. For better performance, > Spark has a sweet spot for how large partitions should be that get executed > by a task. If partitions are too small, then the user pays a disproportionate > cost in scheduling overhead. If the partitions are too large, then task > execution slows down due to gc pressure and spilling to disk. > To increase performance of jobs, users often hand optimize the number(size) > of partitions that the next stage gets. Factors that come into play are: > Incoming partition sizes from previous stage > number of available executors > available memory per executor (taking into account > spark.shuffle.memoryFraction) > Spark has access to this data and so should be able to automatically do the > partition sizing for the user. This feature can be turned off/on with a > configuration option. > To make this happen, we propose modifying the DAGScheduler to take into > account partition sizes upon stage completion. Before scheduling the next > stage, the scheduler can examine the sizes of the partitions and determine > the appropriate number tasks to create. Since this change requires > non-trivial modifications to the DAGScheduler, a detailed design doc will be > attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4630: --- Assignee: Apache Spark (was: Kostas Sakellis) > Dynamically determine optimal number of partitions > -- > > Key: SPARK-4630 > URL: https://issues.apache.org/jira/browse/SPARK-4630 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kostas Sakellis >Assignee: Apache Spark > > Partition sizes play a big part in how fast stages execute during a Spark > job. There is a direct relationship between the size of partitions to the > number of tasks - larger partitions, fewer tasks. For better performance, > Spark has a sweet spot for how large partitions should be that get executed > by a task. If partitions are too small, then the user pays a disproportionate > cost in scheduling overhead. If the partitions are too large, then task > execution slows down due to gc pressure and spilling to disk. > To increase performance of jobs, users often hand optimize the number(size) > of partitions that the next stage gets. Factors that come into play are: > Incoming partition sizes from previous stage > number of available executors > available memory per executor (taking into account > spark.shuffle.memoryFraction) > Spark has access to this data and so should be able to automatically do the > partition sizing for the user. This feature can be turned off/on with a > configuration option. > To make this happen, we propose modifying the DAGScheduler to take into > account partition sizes upon stage completion. Before scheduling the next > stage, the scheduler can examine the sizes of the partitions and determine > the appropriate number tasks to create. Since this change requires > non-trivial modifications to the DAGScheduler, a detailed design doc will be > attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org