[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164095#comment-14164095
 ] 

Mridul Muralidharan commented on SPARK-3561:
--------------------------------------------


I agree with [~pwendell] that it does not help spark to introduce dependency of 
tez on core.
[~ozhurakousky] is tez available on all yarn clusters ? Or is it an additional 
runtime dependency ?

If it is available by default - we can make it a runtime switch to use tez for 
jobs running on yarn-standalone and yarn-client mode.
But before that ...


While better multi-tennancy would be a likely benefit - my specific interest in 
this patch is more to do with the much better shuffle performance that tez 
offers :-) Specicially for ETL jobs, I can see other benefits which might be 
relevant - one of our collaborative filtering implementation, though not ETL, 
comes fairly close to it in job characterstics and suffers due to some of our 
shuffle issues ...


As I alluded to, I do not think we should have an openended extension point - 
where any class name can be provided which extends functionality in arbitrary 
manner - for example, like the spi we have for compression codecs.
As Patrick mentioned, this gives the impression that the approach is blessed by 
spark developers - even if tagged with Experimental.
Particularly with core internals, I would be very wary of exposing them via an 
spi - simply because we need the freedom to evolve them for performance or 
functionality reasons.


On other hand, I am in favour of exploring this option to see what sort of 
benefits we get out of this assuming it has been prototyped already - which I 
thought was the case here, though I am yet to see a PR with that (not sure if I 
missed it !).
Given that Tez is supposed to be reasonably mature - if there is a spark + tez 
version, I want to see what benefits (if any) are observed as a result of this 
effort.
I had discussed spark + tez integration about an year or so back with Matei - 
but at that time, tez was probably not that mature - maybe this is a better 
time !

[~ozhurakousky] Do you have a spark on tez prototype done already ? Or is this 
an experiment you are yet to complete ? If complete, what sort of performance 
difference do you see ? What metrics are you using ?


If there are significant benefits, I would want to take a closer look at the 
final proposed patch ... I would be interested in it making into spark in some 
form.

As [~nchammas] mentioned - if it is possible to address it in spark directly, 
nothing like it - particularly since it will benefit all modes of execution and 
not just yarn + tez combination.
If the gap cant be narrowed, and the benefits are significant (for some, as of 
now underfined, definition of "benefits" and "significant") - then we can 
consider tez dependency in yarn module.

Ofcourse, all these questions are moot - until we have better quantitative 
judgement of what the expected gains are and what the experimental results are.

> Allow for pluggable execution contexts in Spark
> -----------------------------------------------
>
>                 Key: SPARK-3561
>                 URL: https://issues.apache.org/jira/browse/SPARK-3561
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Oleg Zhurakousky
>              Labels: features
>             Fix For: 1.2.0
>
>         Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal: 
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@DeveloperAPI) not exposed to end users of Spark. 
> The trait will define 4 only operations: 
> * hadoopFile 
> * newAPIHadoopFile 
> * broadcast 
> * runJob 
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext. 
> Please see the attached design doc for more details. 
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to