[ 
https://issues.apache.org/jira/browse/SPARK-19100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15811786#comment-15811786
 ] 

Steve Loughran commented on SPARK-19100:
----------------------------------------

it's hard to imagine any dataset where large input sizes imply shorter 
processing times, though with columnar formats and predicate pushdown there's 
(ideally) enough column skipping going on that less data is being read. Even 
there, it'd have to take very unbalanced source data for the larger datasets to 
be slower than the smaller ones.

Even if you don't go to the complexity today, it'd probably be wise to code 
that, in future, you could make the ordering decision based on some other 
metric which comes from the data source. There's been some interesting past 
work on sample-based split calculation; some of that could also related to 
providing hints about scheduling order.

> Schedule tasks in descending order of estimated input size / estimated task 
> duration
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-19100
>                 URL: https://issues.apache.org/jira/browse/SPARK-19100
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: Josh Rosen
>
> Say that you're scheduling a reduce phase and based on the map output sizes 
> you have identified that some reducers will be skewed due to processing much 
> more input. In this case, it's preferable to schedule those larger tasks 
> first: the large reduce tasks will dominate the completion time of the stage 
> so it's important to launch them as soon as possible.
> Spark's current task scheduling technique is naive and simply launches tasks 
> in ascending order of their indices: 
> https://github.com/apache/spark/blob/903bb8e8a2b84b9ea82acbb8ae9d58754862be3a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L158
> If we implement an interface for a Task to expose a relative size estimate 
> (comparable only within the same TaskSet) and used that instead of index in 
> the initial scheduling then I think we could realize decent stage completion 
> time improvements for highly-skewed stages.
> This is also beneficial for initial map stages where you have input size 
> estimates: if you're generating tasks / splits from HDFS blocks or S3 
> FileStatuses then you have some estimate of the total input size of the task 
> and that may be a decent proxy for the task's duration.
> Basically, my feeling is that scheduling in order of size estimates, even if 
> they're slightly inaccurate, must be a strict improvement over scheduling in 
> ascending order of task id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to