[ 
https://issues.apache.org/jira/browse/SPARK-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259041#comment-14259041
 ] 

YanTang Zhai commented on SPARK-4962:
-------------------------------------

The main purpose of this PR is to decrease resources waste for busy cluster.
For example, a process initializes a SparkContext instance, reads a few files 
from HDFS or many records from PostgreSQL, and then calls RDD's collect 
operation to submit a job.
When SparkContext is initialized, an app is submitted to cluster and some 
resources are hold by this app. 
These resources are not used really until the job is submitted by RDD's action.
The resources in the period from initialization to actual use could be 
considered wasteful.
If app is submitted when SparkContext is initialized, all of resources needed 
by the app may be granted before running job. 
Then the job could runs efficiently without resource constraint.
On the contrary, if app is submitted when job is submitted, resources needed by 
the app may be granted at different times. Then the job may run not so 
efficiently since some resources are applying.
Thus I use a configuration parameter spark.scheduler.app.slowstart (default 
false) to let user make tradeoffs between economy and efficiency.
There are 9 kinds of master URL and 6 kinds of SchedulerBackend.
LocalBackend and SimrSchedulerBackend don't need to put starting back since 
there is no difference.
SparkClusterSchedulerBackend (yarn-standalone or yarn-cluster) does not put 
starting back since the app should be submitted in advance by SparkSubmit.
CoarseMesosSchedulerBackend and MesosSchedulerBackend could put starting back.
YarnClientSchedulerBackend (yarn-client) could put starting back.
This PR puts TaskScheduler.start back only for yarn-client mode in the early.

> Put TaskScheduler.start back in SparkContext to shorten cluster resources 
> occupation period
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4962
>                 URL: https://issues.apache.org/jira/browse/SPARK-4962
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: YanTang Zhai
>            Priority: Minor
>
> When SparkContext object is instantiated, TaskScheduler is started and some 
> resources are allocated from cluster. However, these
> resources may be not used for the moment. For example, 
> DAGScheduler.JobSubmitted is processing and so on. These resources are wasted 
> in
> this period. Thus, we want to put TaskScheduler.start back to shorten cluster 
> resources occupation period specially for busy cluster.
> TaskScheduler could be started just before running stages.
> We could analyse and compare the  resources occupation period before and 
> after optimization.
> TaskScheduler.start execution time: [time1__]
> DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
> TaskScheduler.start) execution time: [time2_]
> HadoopRDD.getPartitions execution time: [time3___]
> Stages execution time: [time4_____]
> The cluster resources occupation period before optimization is 
> [time2_][time3___][time4_____].
> The cluster resources occupation period after optimization 
> is....[time3___][time4_____].
> In summary, the cluster resources
> occupation period after optimization is less than before.
> If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may 
> be shorten more which is [time4_____].
> The resources saving is important for busy cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to