[jira] [Commented] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

Rajesh Balamohan (JIRA) Thu, 28 Jan 2016 02:53:17 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-13059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121214#comment-15121214
 ]


Rajesh Balamohan commented on SPARK-13059:
------------------------------------------

Thanks [~srowen]. The same problem would exist even today, if InputFormats 
choose to change the split calculation (for any reason). 

> Sort inputsplits by size in HadoopRDD to avoid long tails
> ---------------------------------------------------------
>
>                 Key: SPARK-13059
>                 URL: https://issues.apache.org/jira/browse/SPARK-13059
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Rajesh Balamohan
>
> HadoopRDD.getPartitions invokes getSplits from the inputformat and returns 
> the HadoopPartition.  There are cases where the input splits generated are 
> not  of equal sizes all the time and some splits would be much smaller than 
> others.   If bigger splits are scheduled at the end of the job, there is a 
> possibility of getting long tail in the job.  Sorting the input splits by 
> size (in descending order) can help in scheduling the larger splits upfront. 
> This could also help in speculation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

Reply via email to