[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114626#comment-16114626
 ] 

Dhruve Ashar commented on SPARK-20589:
--------------------------------------

Spark defines stages based on the shuffle dependencies and the application 
developer does not have a direct way to control this behavior. Moreover, it 
would require the application developer to exactly know how the stage 
boundaries are defined and set the max concurrency with the desired 
transformation. This would need a change in the API to support this at a 
transformation level and resolve the max concurrent tasks across tasks in a 
given stage. 

Since the scope for this change is much broader and involved, in the meanwhile 
I would like to propose an approach where we limit the no. of concurrent tasks 
on a per job basis rather than a per stage basis. This way the developer has 
control over limiting only a single spark job in his application. This approach 
can be implemented by following two steps:
1 - Specify the concurrency metric for the job.
2 - Tag the job to be limited in a specific job group.

So the application code, looks something like this:

{code:java}
// limit concurrency for all the job/s under jobGroupId => myjob
conf.set("spark.job.myjob.maxConcurrentTasks","10") 
sc.parallelize(1 to Int.MaxValue, 10000).map(x => x + 1).map(x => x - 1).map(x 
=> x * 1).count()
// tag the job to be limited under the respective jobGroupId
sc.setJobGroup("myjob","",false) 
sc.parallelize(1 to Int.MaxValue, 10000).map(x => x + 1).map(x => x - 1).map(x 
=> x * 1).count()
// clear the tag or set it to a different value.
sc.clearJobGroup 
sc.parallelize(1 to Int.MaxValue, 10000).map(x => x + 1).map(x => x - 1).map(x 
=> x * 1).count()
{code}

> Allow limiting task concurrency per stage
> -----------------------------------------
>
>                 Key: SPARK-20589
>                 URL: https://issues.apache.org/jira/browse/SPARK-20589
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to