Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/14854 Another advantage of this PR's approach is straggler-resilience: imagine that you're going to have to compute all partitions anyways and also assume that some small number of partitions will dominate the total computation time. If you take, say, a 1000 partition job and break it into ten 100 partition jobs and assume that there are 10 straggler tasks total, then splitting the total work into a sequence of multiple jobs increases the likelihood that those stragglers will run in separate jobs, causing their total runtimes to be added together when determining the total runtime of the action. In contrast, submitting all of those tasks as part of the same job allows them to overlap with each other and run in parallel once all of the smaller tasks have finished, resulting in a better worst-case runtime.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org