[ https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-4961. ------------------------------ Resolution: Won't Fix > Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted > processing time > --------------------------------------------------------------------------------------- > > Key: SPARK-4961 > URL: https://issues.apache.org/jira/browse/SPARK-4961 > Project: Spark > Issue Type: Improvement > Components: Scheduler > Reporter: YanTang Zhai > Priority: Minor > > HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. > If inputdir is large, getPartitions may spend much time. > For example, in our cluster, it needs from 0.029s to 766.699s. If one > JobSubmitted event is processing, others should wait. Thus, we > want to put HadoopRDD.getPartitions forward to reduce > DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't > need to wait much time. HadoopRDD object could get its partitons when it is > instantiated. > We could analyse and compare the execution time before and after optimization. > TaskScheduler.start execution time: [time1__] > DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or > TaskScheduler.start) execution time: [time2_] > HadoopRDD.getPartitions execution time: [time3___] > Stages execution time: [time4_____] > (1) The app has only one job > (a) > The execution time of the job before optimization is > [time1__][time2_][time3___][time4_____]. > The execution time of the job after optimization > is....[time1__][time3___][time2_][time4_____]. > In summary, if the app has only one job, the total execution time is same > before and after optimization. > (2) The app has 4 jobs > (a) Before optimization, > job1 execution time is [time2_][time3___][time4_____], > job2 execution time is [time2__________][time3___][time4_____], > job3 execution time > is................................[time2____][time3___][time4_____], > job4 execution time > is................................[time2_____________][time3___][time4_____]. > After optimization, > job1 execution time is [time3___][time2_][time4_____], > job2 execution time is [time3___][time2__][time4_____], > job3 execution time > is................................[time3___][time2_][time4_____], > job4 execution time > is................................[time3___][time2__][time4_____]. > In summary, if the app has multiple jobs, average execution time after > optimization is less than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org