[jira] [Assigned] (SPARK-4961) Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4961:
---

Assignee: (was: Apache Spark)

 Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted 
 processing time
 ---

 Key: SPARK-4961
 URL: https://issues.apache.org/jira/browse/SPARK-4961
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: YanTang Zhai
Priority: Minor

 HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
 If inputdir is large, getPartitions may spend much time.
 For example, in our cluster, it needs from 0.029s to 766.699s. If one 
 JobSubmitted event is processing, others should wait. Thus, we
 want to put HadoopRDD.getPartitions forward to reduce 
 DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
 need to wait much time. HadoopRDD object could get its partitons when it is 
 instantiated.
 We could analyse and compare the execution time before and after optimization.
 TaskScheduler.start execution time: [time1__]
 DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
 TaskScheduler.start) execution time: [time2_]
 HadoopRDD.getPartitions execution time: [time3___]
 Stages execution time: [time4_]
 (1) The app has only one job
 (a)
 The execution time of the job before optimization is 
 [time1__][time2_][time3___][time4_].
 The execution time of the job after optimization 
 is[time1__][time3___][time2_][time4_].
 In summary, if the app has only one job, the total execution time is same 
 before and after optimization.
 (2) The app has 4 jobs
 (a) Before optimization,
 job1 execution time is [time2_][time3___][time4_],
 job2 execution time is [time2__][time3___][time4_],
 job3 execution time 
 is[time2][time3___][time4_],
 job4 execution time 
 is[time2_][time3___][time4_].
 After optimization, 
 job1 execution time is [time3___][time2_][time4_],
 job2 execution time is [time3___][time2__][time4_],
 job3 execution time 
 is[time3___][time2_][time4_],
 job4 execution time 
 is[time3___][time2__][time4_].
 In summary, if the app has multiple jobs, average execution time after 
 optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4961) Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4961:
---

Assignee: Apache Spark

 Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted 
 processing time
 ---

 Key: SPARK-4961
 URL: https://issues.apache.org/jira/browse/SPARK-4961
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: YanTang Zhai
Assignee: Apache Spark
Priority: Minor

 HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
 If inputdir is large, getPartitions may spend much time.
 For example, in our cluster, it needs from 0.029s to 766.699s. If one 
 JobSubmitted event is processing, others should wait. Thus, we
 want to put HadoopRDD.getPartitions forward to reduce 
 DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
 need to wait much time. HadoopRDD object could get its partitons when it is 
 instantiated.
 We could analyse and compare the execution time before and after optimization.
 TaskScheduler.start execution time: [time1__]
 DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
 TaskScheduler.start) execution time: [time2_]
 HadoopRDD.getPartitions execution time: [time3___]
 Stages execution time: [time4_]
 (1) The app has only one job
 (a)
 The execution time of the job before optimization is 
 [time1__][time2_][time3___][time4_].
 The execution time of the job after optimization 
 is[time1__][time3___][time2_][time4_].
 In summary, if the app has only one job, the total execution time is same 
 before and after optimization.
 (2) The app has 4 jobs
 (a) Before optimization,
 job1 execution time is [time2_][time3___][time4_],
 job2 execution time is [time2__][time3___][time4_],
 job3 execution time 
 is[time2][time3___][time4_],
 job4 execution time 
 is[time2_][time3___][time4_].
 After optimization, 
 job1 execution time is [time3___][time2_][time4_],
 job2 execution time is [time3___][time2__][time4_],
 job3 execution time 
 is[time3___][time2_][time4_],
 job4 execution time 
 is[time3___][time2__][time4_].
 In summary, if the app has multiple jobs, average execution time after 
 optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org