[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:
------------------------------------
    Description: 
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
JobExecutionContext by either implementing it from scratch or extending form 
DefaultExecutionContext.
Please see the attached design doc and pull request for more details.

  was:
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.
Please see the attached design doc and pull request for more details.


> Expose pluggable architecture to facilitate native integration with 
> third-party execution environments.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3561
>                 URL: https://issues.apache.org/jira/browse/SPARK-3561
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Oleg Zhurakousky
>              Labels: features
>             Fix For: 1.2.0
>
>         Attachments: SPARK-3561.pdf
>
>
> Currently Spark _integrates with external resource-managing platforms_ such 
> as Apache Hadoop YARN and Mesos to facilitate 
> execution of Spark DAG in a distributed environment provided by those 
> platforms. 
> However, this integration is tightly coupled within Spark's implementation 
> making it rather difficult to introduce integration points with 
> other resource-managing platforms without constant modifications to Spark's 
> core (see comments below for more details). 
> In addition, Spark _does not provide any integration points to a third-party 
> **DAG-like** and **DAG-capable** execution environments_ native 
> to those platforms, thus limiting access to some of their native features 
> (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
> and monitoring and more) as well as specialization aspects of
> such execution environments (open source and proprietary). As an example, 
> inability to gain access to such features are starting to affect Spark's 
> viability in large scale, batch 
> and/or ETL applications. 
> Introducing a pluggable architecture would solve both of the issues mentioned 
> above ultimately benefitting Spark's technology and community by allowing it 
> to 
> venture into co-existence and collaboration with a variety of existing Big 
> Data platforms as well as the once yet to come to the market.
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - as a non-public api (@DeveloperAPI).
> The trait will define 4 operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via 
> master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
> implementation containing the existing code from SparkContext, thus allowing 
> current 
> (corresponding) methods of SparkContext to delegate to such implementation 
> ensuring binary and source compatibility with older versions of Spark.  
> An integrator will now have an option to provide custom implementation of 
> JobExecutionContext by either implementing it from scratch or extending form 
> DefaultExecutionContext.
> Please see the attached design doc and pull request for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to