[jira] [Created] (SPARK-3545) Put HadoopRDD.getPartitions forward and put TaskScheduler.start back in SparkContext to reduce DAGScheduler.JobSubmitted processing time and shorten cluster resources occupation period

YanTang Zhai (JIRA) Tue, 16 Sep 2014 02:29:57 -0700

YanTang Zhai created SPARK-3545:
-----------------------------------

             Summary: Put HadoopRDD.getPartitions forward and put 
TaskScheduler.start back in SparkContext to reduce DAGScheduler.JobSubmitted 
processing time and shorten cluster resources occupation period
                 Key: SPARK-3545
                 URL: https://issues.apache.org/jira/browse/SPARK-3545
             Project: Spark
          Issue Type: Improvement
            Reporter: YanTang Zhai
            Priority: Minor



We have two problems:
(1) HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
If inputdir is large, getPartitions may spend much time. 

For example, in our cluster, it needs from 0.029s to 766.699s. If one 
JobSubmitted event is processing, others should wait. Thus, we 

want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted 
processing time. Then other JobSubmitted event don't 

need to wait much time. HadoopRDD object could get its partitons when it is 
instantiated.
(2) When SparkContext object is instantiated, TaskScheduler is started and some 
resources are allocated from cluster. However, these 

resources may be not used for the moment. For example, 
DAGScheduler.JobSubmitted is processing and so on. These resources are wasted 
in 

this period. Thus, we want to put TaskScheduler.start back to shorten cluster 
resources occupation period specially for busy cluster. 

TaskScheduler could be started just before running stages.

We could analyse and compare the execution time before and after optimization.
TaskScheduler.start execution time: [time1__]
DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
TaskScheduler.start) execution time: [time2_]
HadoopRDD.getPartitions execution time: [time3___]
Stages execution time: [time4_____]
(1) The app has only one job
(a) The execution time of the job before optimization is 
[time1__][time2_][time3___][time4_____].
    The execution time of the job after optimization is  
[time3___][time2_][time1__][time4_____].
(b) The cluster resources occupation period before optimization is 
[time2_][time3___][time4_____].
    The cluster resources occupation period after optimization is  [time4_____].
In summary, if the app has only one job, the total execution time is same 
before and after optimization while the cluster resources 

occupation period after optimization is less than before.
(2) The app has 4 jobs
(a) Before optimization, job1 execution time is [time2_][time3___][time4_____],
                         job2 execution time is 
[time2____________][time3___][time4_____],
                         job3 execution time is                    
[time2____][time3___][time4_____],
                         job4 execution time is                    
[time2_______________][time3___][time4_____].
    After optimization,  job1 execution time is 
[time3___][time2_][time1__][time4_____],
                         job2 execution time is 
[time3___][time2___________][time4_____],
                         job3 execution time is                    
[time3___][time2_][time4_____],
                         job4 execution time is                    
[time3___][time2__][time4_____].
In summary, if the app has multiple jobs, average execution time after 
optimization is less than before and the cluster resources 

occupation period after optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3545) Put HadoopRDD.getPartitions forward and put TaskScheduler.start back in SparkContext to reduce DAGScheduler.JobSubmitted processing time and shorten cluster resources occupation period

Reply via email to