[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

Sandy Ryza (JIRA) Sun, 30 Nov 2014 18:53:32 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229371#comment-14229371
 ]


Sandy Ryza commented on SPARK-4630:
-----------------------------------

Hey [~pwendell], Spark deals much better with large numbers of partitions than 
MR, but, in our deployments, we've found the number of partitions to be by far 
the most important configuration knob in determining an app's performance.  My 
recommendation to novice users is often to just keep doubling the number of 
partitions until problems go away.  I don't have good numbers, but I'd say this 
has been the solution to maybe half of the performance issues I've been asked 
to assist with.

Most issues we hit are with not having enough partitions.  This causes high GC 
time and unnecessary spilling.  As for why not recommend the parallelism as 
100,000 all the time, the eventual overheads that would kick in are more state 
to hold on to the driver, more random I/O in the shuffle as tasks fetch smaller 
map output blocks, and more task startup overhead

So why not just make Spark's existing heuristic use a higher number of 
partitions? 

First, it's unclear how we would actually do this.  The heuristic we have now 
is to rely on the MR input format for the number of partitions in input stages 
and then, for child stages, to use the number of partitions in the parent 
stage.  A heuristic that, for example, doubled the number of partitions in the 
parent stage, wouldn't make sense in the common case of a reduceByKey.

Second, a heuristic that doesn't take the actual size of the data into account 
can't handle situations where a stage's output is drastically larger or smaller 
than its input.  Smaller output is typical for reduceByKey-style operations.  
Larger output is typical when flattening data or when reading highly compressed 
files.  A typical 1GB Parquet split will decompress to 2-5GB in memory - 
running a sortByKey, join, or groupByKey results in each reducer forced to 
sort/aggregate that much data.
 
Ok, that was a long comment.  I obviously haven't been able to exhaust the 
whole heuristic space, so any simpler ideas that I'm not thinking of are of 
course appreciated.

> Dynamically determine optimal number of partitions
> --------------------------------------------------
>
>                 Key: SPARK-4630
>                 URL: https://issues.apache.org/jira/browse/SPARK-4630
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Kostas Sakellis
>            Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark 
> job. There is a direct relationship between the size of partitions to the 
> number of tasks - larger partitions, fewer tasks. For better performance, 
> Spark has a sweet spot for how large partitions should be that get executed 
> by a task. If partitions are too small, then the user pays a disproportionate 
> cost in scheduling overhead. If the partitions are too large, then task 
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) 
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account 
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the 
> partition sizing for the user. This feature can be turned off/on with a 
> configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into 
> account partition sizes upon stage completion. Before scheduling the next 
> stage, the scheduler can examine the sizes of the partitions and determine 
> the appropriate number tasks to create. Since this change requires 
> non-trivial modifications to the DAGScheduler, a detailed design doc will be 
> attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

Reply via email to