[jira] [Commented] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

Sean Owen (JIRA) Thu, 28 Jan 2016 00:49:06 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-13059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121044#comment-15121044
 ]


Sean Owen commented on SPARK-13059:
-----------------------------------

I like the idea. Although I'm not 100% sure I think that wouldn't violate any 
guaranteed behavior. For example take(1) should still take from the first 
partition; in other jobs the first partition may simply be scheduled later.

> Sort inputsplits by size in HadoopRDD to avoid long tails
> ---------------------------------------------------------
>
>                 Key: SPARK-13059
>                 URL: https://issues.apache.org/jira/browse/SPARK-13059
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Rajesh Balamohan
>
> HadoopRDD.getPartitions invokes getSplits from the inputformat and returns 
> the HadoopPartition.  There are cases where the input splits generated are 
> not  of equal sizes all the time and some splits would be much smaller than 
> others.   If bigger splits are scheduled at the end of the job, there is a 
> possibility of getting long tail in the job.  Sorting the input splits by 
> size (in descending order) can help in scheduling the larger splits upfront. 
> This could also help in speculation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

Reply via email to