[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229371#comment-14229371 ]
Sandy Ryza commented on SPARK-4630: ----------------------------------- Hey [~pwendell], Spark deals much better with large numbers of partitions than MR, but, in our deployments, we've found the number of partitions to be by far the most important configuration knob in determining an app's performance. My recommendation to novice users is often to just keep doubling the number of partitions until problems go away. I don't have good numbers, but I'd say this has been the solution to maybe half of the performance issues I've been asked to assist with. Most issues we hit are with not having enough partitions. This causes high GC time and unnecessary spilling. As for why not recommend the parallelism as 100,000 all the time, the eventual overheads that would kick in are more state to hold on to the driver, more random I/O in the shuffle as tasks fetch smaller map output blocks, and more task startup overhead So why not just make Spark's existing heuristic use a higher number of partitions? First, it's unclear how we would actually do this. The heuristic we have now is to rely on the MR input format for the number of partitions in input stages and then, for child stages, to use the number of partitions in the parent stage. A heuristic that, for example, doubled the number of partitions in the parent stage, wouldn't make sense in the common case of a reduceByKey. Second, a heuristic that doesn't take the actual size of the data into account can't handle situations where a stage's output is drastically larger or smaller than its input. Smaller output is typical for reduceByKey-style operations. Larger output is typical when flattening data or when reading highly compressed files. A typical 1GB Parquet split will decompress to 2-5GB in memory - running a sortByKey, join, or groupByKey results in each reducer forced to sort/aggregate that much data. Ok, that was a long comment. I obviously haven't been able to exhaust the whole heuristic space, so any simpler ideas that I'm not thinking of are of course appreciated. > Dynamically determine optimal number of partitions > -------------------------------------------------- > > Key: SPARK-4630 > URL: https://issues.apache.org/jira/browse/SPARK-4630 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Kostas Sakellis > Assignee: Kostas Sakellis > > Partition sizes play a big part in how fast stages execute during a Spark > job. There is a direct relationship between the size of partitions to the > number of tasks - larger partitions, fewer tasks. For better performance, > Spark has a sweet spot for how large partitions should be that get executed > by a task. If partitions are too small, then the user pays a disproportionate > cost in scheduling overhead. If the partitions are too large, then task > execution slows down due to gc pressure and spilling to disk. > To increase performance of jobs, users often hand optimize the number(size) > of partitions that the next stage gets. Factors that come into play are: > Incoming partition sizes from previous stage > number of available executors > available memory per executor (taking into account > spark.shuffle.memoryFraction) > Spark has access to this data and so should be able to automatically do the > partition sizing for the user. This feature can be turned off/on with a > configuration option. > To make this happen, we propose modifying the DAGScheduler to take into > account partition sizes upon stage completion. Before scheduling the next > stage, the scheduler can examine the sizes of the partitions and determine > the appropriate number tasks to create. Since this change requires > non-trivial modifications to the DAGScheduler, a detailed design doc will be > attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org