[ 
https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4961:
-----------------------------------

    Assignee:     (was: Apache Spark)

> Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted 
> processing time
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-4961
>                 URL: https://issues.apache.org/jira/browse/SPARK-4961
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: YanTang Zhai
>            Priority: Minor
>
> HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
> If inputdir is large, getPartitions may spend much time.
> For example, in our cluster, it needs from 0.029s to 766.699s. If one 
> JobSubmitted event is processing, others should wait. Thus, we
> want to put HadoopRDD.getPartitions forward to reduce 
> DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
> need to wait much time. HadoopRDD object could get its partitons when it is 
> instantiated.
> We could analyse and compare the execution time before and after optimization.
> TaskScheduler.start execution time: [time1__]
> DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
> TaskScheduler.start) execution time: [time2_]
> HadoopRDD.getPartitions execution time: [time3___]
> Stages execution time: [time4_____]
> (1) The app has only one job
> (a)
> The execution time of the job before optimization is 
> [time1__][time2_][time3___][time4_____].
> The execution time of the job after optimization 
> is....[time1__][time3___][time2_][time4_____].
> In summary, if the app has only one job, the total execution time is same 
> before and after optimization.
> (2) The app has 4 jobs
> (a) Before optimization,
> job1 execution time is [time2_][time3___][time4_____],
> job2 execution time is [time2__________][time3___][time4_____],
> job3 execution time 
> is................................[time2____][time3___][time4_____],
> job4 execution time 
> is................................[time2_____________][time3___][time4_____].
> After optimization, 
> job1 execution time is [time3___][time2_][time4_____],
> job2 execution time is [time3___][time2__][time4_____],
> job3 execution time 
> is................................[time3___][time2_][time4_____],
> job4 execution time 
> is................................[time3___][time2__][time4_____].
> In summary, if the app has multiple jobs, average execution time after 
> optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to