[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164095#comment-14164095 ]
Mridul Muralidharan commented on SPARK-3561: -------------------------------------------- I agree with [~pwendell] that it does not help spark to introduce dependency of tez on core. [~ozhurakousky] is tez available on all yarn clusters ? Or is it an additional runtime dependency ? If it is available by default - we can make it a runtime switch to use tez for jobs running on yarn-standalone and yarn-client mode. But before that ... While better multi-tennancy would be a likely benefit - my specific interest in this patch is more to do with the much better shuffle performance that tez offers :-) Specicially for ETL jobs, I can see other benefits which might be relevant - one of our collaborative filtering implementation, though not ETL, comes fairly close to it in job characterstics and suffers due to some of our shuffle issues ... As I alluded to, I do not think we should have an openended extension point - where any class name can be provided which extends functionality in arbitrary manner - for example, like the spi we have for compression codecs. As Patrick mentioned, this gives the impression that the approach is blessed by spark developers - even if tagged with Experimental. Particularly with core internals, I would be very wary of exposing them via an spi - simply because we need the freedom to evolve them for performance or functionality reasons. On other hand, I am in favour of exploring this option to see what sort of benefits we get out of this assuming it has been prototyped already - which I thought was the case here, though I am yet to see a PR with that (not sure if I missed it !). Given that Tez is supposed to be reasonably mature - if there is a spark + tez version, I want to see what benefits (if any) are observed as a result of this effort. I had discussed spark + tez integration about an year or so back with Matei - but at that time, tez was probably not that mature - maybe this is a better time ! [~ozhurakousky] Do you have a spark on tez prototype done already ? Or is this an experiment you are yet to complete ? If complete, what sort of performance difference do you see ? What metrics are you using ? If there are significant benefits, I would want to take a closer look at the final proposed patch ... I would be interested in it making into spark in some form. As [~nchammas] mentioned - if it is possible to address it in spark directly, nothing like it - particularly since it will benefit all modes of execution and not just yarn + tez combination. If the gap cant be narrowed, and the benefits are significant (for some, as of now underfined, definition of "benefits" and "significant") - then we can consider tez dependency in yarn module. Ofcourse, all these questions are moot - until we have better quantitative judgement of what the expected gains are and what the experimental results are. > Allow for pluggable execution contexts in Spark > ----------------------------------------------- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 1.1.0 > Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: SPARK-3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. JobExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > DefaultExecutionContext by either implementing it from scratch or extending > form DefaultExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org