[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2015-04-27 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514062#comment-14514062
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Here is an interesting read that provides an ever stronger case for separating 
flow construction vs execution context.
http://blog.acolyer.org/2015/04/27/musketeer-part-i-whats-the-best-data-processing-system/
and
http://www.cl.cam.ac.uk/research/srg/netos/camsas/pubs/eurosys15-musketeer.pdf

The key points are:

_It thus makes little sense to force the user to target a single system at 
workflow implementation time. Instead, we argue that users should, in 
principle, be able to execute their high-level workflow on any data processing 
system (§3). Being able to do this has three main benefits:_
_1. Users write their workflow once, in a way they choose, but can easily 
execute it on alternative systems;_
_2. Multiple sub-components of a workflow can be executed on different back-end 
systems; and_
_3. Existing workflows can easily be ported to new systems._

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@Experimental) not exposed to end users of Spark. 
 The trait will define 6 operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 * persist
 * unpersist
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2015-01-05 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265113#comment-14265113
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Sorry for the delay in response, I'll just blame the holidays ;)
No, I have not had a chance to run the elasticity tests against 1.2, so I am 
gonna have to follow up on that.

The main motivation for this proposal is to _formalize an extension model 
around Spark’s execution environment_ to allow other execution environments 
(new and existing) to be easily plugged-in by a system integrator without 
requiring a new release of Spark (giving current integration mechanism which 
relies on ‘case’ statement with hard-coded values).
Reasons for _why this is necessary?_ are many, but could all be summarized 
around an old **_generalization_** vs. **_specialization_** argument. And while 
_Tez, elastic scaling, utilization of cluster resources_ are all good examples 
and indeed were the initial motivators, they are certainly not the end and 
current efforts of several clients of ours who are integrating Spark with their 
custom execution environments using the proposed approach is a good evidence of 
its viability and an obvious benefit to Spark’s technology, allowing it to 
become a developer friendly “face” of many execution environments/technologies 
while continuing innovation of its own.

So I think the next logical step would be to gather “for” and “against” 
arguments around pluggable execution context for Spark” in general, then we 
can discuss implementation. 

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@Experimental) not exposed to end users of Spark. 
 The trait will define 6 operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 * persist
 * unpersist
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-30 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190337#comment-14190337
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Just as an FYI

The POC code for Tez-based reference implementation of aforementioned 
_execution context_ is available - 
https://github.com/hortonworks/spark-native-yarn together with a samples 
project - https://github.com/hortonworks/spark-native-yarn-samples.

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@Experimental) not exposed to end users of Spark. 
 The trait will define 6 operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 * persist
 * unpersist
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-28 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186987#comment-14186987
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

[~vanzin]

I would not call it hard (we have done it in the initial POC by simply mixing 
custom trait into SC - essentially extending it), however I do agree that a lot 
of Spark's initialization would still happen due to the implementation of SC 
itself thus creating and initializing some of the artifacts that may not be 
used with different execution context. 
Question; Why was it done like this and not pushed into some SC.init operation? 

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@Experimental) not exposed to end users of Spark. 
 The trait will define 6 operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 * persist
 * unpersist
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-19 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176501#comment-14176501
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Resubmitting Pull Request (squashed) which has additional 2 methods added to 
the _JobExecutionContext_ to support caching. - 
https://github.com/apache/spark/pull/2849
Prototype will be published shortly as well

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-19 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal: 
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@Experimental) not exposed to end users of Spark. 
The trait will define 6 operations: 
* hadoopFile 
* newAPIHadoopFile 
* broadcast 
* runJob 
* persist
* unpersist

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext. 

Please see the attached design doc for more details. 
Pull Request will be posted shortly as well

  was:
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal: 
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark. 
The trait will define 4 only operations: 
* hadoopFile 
* newAPIHadoopFile 
* broadcast 
* runJob 

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext. 

Please see the attached design doc for more details. 
Pull Request will be posted shortly as well


 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@Experimental) not exposed to end users of Spark. 
 The trait will define 6 operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 * persist
 * unpersist
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This 

[jira] [Updated] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-19 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal: 
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@Experimental) not exposed to end users of Spark. 
The trait will define 6 operations: 
* hadoopFile 
* newAPIHadoopFile 
* broadcast 
* runJob 
* persist
* unpersist

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext. 

Please see the attached design doc for more details. 

  was:
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal: 
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@Experimental) not exposed to end users of Spark. 
The trait will define 6 operations: 
* hadoopFile 
* newAPIHadoopFile 
* broadcast 
* runJob 
* persist
* unpersist

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext. 

Please see the attached design doc for more details. 
Pull Request will be posted shortly as well


 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@Experimental) not exposed to end users of Spark. 
 The trait will define 6 operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 * persist
 * unpersist
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-10 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166705#comment-14166705
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Patrick, I think there is misunderstanding about the mechanics of this 
proposal, so I'd like to clarify. The proposal here is certainly not to 
introduce any new dependencies to Spark Core and 
existing pull request (https://github.com/apache/spark/pull/2422) clearly shows 
it. 

What I am proposing is to expose an integration point in Spark by means of 
extracting *existing* Spark operations into a *configurable and @Experimental* 
strategy, allowing Spark not only to integrate with other execution 
environments, but it would also be very useful in unit-testing as it would 
provide a clear separation between _assembly_ and _execution_ layer allowing 
them to be tested in isolation. 

I think this feature would benefit Spark tremendously; particularly given how 
several folks have already expressed their interest in this feature/direction.

Appreciate your help and advise in helping to get this contribution into Spark. 
Thanks!

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-10 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166705#comment-14166705
 ] 

Oleg Zhurakousky edited comment on SPARK-3561 at 10/10/14 12:10 PM:


Patrick, I think there is misunderstanding about the mechanics of this 
proposal, so I'd like to clarify. The proposal here is certainly not to 
introduce any new dependencies to Spark Core and 
existing pull request (https://github.com/apache/spark/pull/2422) clearly shows 
it. 

What I am proposing is to expose an integration point in Spark by means of 
extracting *existing* Spark operations into a *configurable and @Experimental* 
strategy, allowing Spark not only to integrate with other execution contexts, 
but it would also be very useful in unit-testing as it would provide a clear 
separation between _assembly_ and _execution_ layer allowing them to be tested 
in isolation. 

I think this feature would benefit Spark tremendously; particularly given how 
several folks have already expressed their interest in this feature/direction.

Appreciate your help and advise in helping to get this contribution into Spark. 
Thanks!


was (Author: ozhurakousky):
Patrick, I think there is misunderstanding about the mechanics of this 
proposal, so I'd like to clarify. The proposal here is certainly not to 
introduce any new dependencies to Spark Core and 
existing pull request (https://github.com/apache/spark/pull/2422) clearly shows 
it. 

What I am proposing is to expose an integration point in Spark by means of 
extracting *existing* Spark operations into a *configurable and @Experimental* 
strategy, allowing Spark not only to integrate with other execution 
environments, but it would also be very useful in unit-testing as it would 
provide a clear separation between _assembly_ and _execution_ layer allowing 
them to be tested in isolation. 

I think this feature would benefit Spark tremendously; particularly given how 
several folks have already expressed their interest in this feature/direction.

Appreciate your help and advise in helping to get this contribution into Spark. 
Thanks!

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-07 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162013#comment-14162013
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Patrick, your point about confusion with other JIRAs makes sense. Thanks.

With regard to detailed design, can you please let me know what you are looking 
for and what would be useful? The only changes I'm proposing is the addition of 
an @Experimental interface with the 4 methods for the reasons stated in the 
design doc. 

For example: 
* Would it be useful if I sent another PR with the implementation of the 
interface? 
* Would it be useful if I shared benchmarks which showcase some of the benefits 
of alternative execution for Batch/ETL scenarios?

Since this is my first involvement in the Spark community, I appreciate your 
guidance and I'm happy to provide any details you might find useful. Thanks!

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159559#comment-14159559
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Sandy, one other thing:
While I understand the reasoning for changes to the title and the description 
of the JIRA, it would probably be better to coordinate this with the original 
submitter before making such changes in the future (similar to the way Patric 
suggested in SPARK-3174). This would alleviate potential discrepancies in the 
overall message and intentions of the JIRA. 
Anyway, I’ve edited both the title and the description taking into 
consideration your edits.

 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark _integrates with external resource-managing platforms_ such 
 as Apache Hadoop YARN and Mesos to facilitate 
 execution of Spark DAG in a distributed environment provided by those 
 platforms. 
 However, this integration is tightly coupled within Spark's implementation 
 making it rather difficult to introduce integration points with 
 other resource-managing platforms without constant modifications to Spark's 
 core (see comments below for more details). 
 In addition, Spark _does not provide any integration points to a third-party 
 **DAG-like** and **DAG-capable** execution environments_ native 
 to those platforms, thus limiting access to some of their native features 
 (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
 and monitoring and more) as well as specialization aspects of
 such execution environments (open source and proprietary). As an example, 
 inability to gain access to such features are starting to affect Spark's 
 viability in large scale, batch 
 and/or ETL applications. 
 Introducing a pluggable architecture would solve both of the issues mentioned 
 above ultimately benefitting Spark's technology and community by allowing it 
 to 
 venture into co-existence and collaboration with a variety of existing Big 
 Data platforms as well as the once yet to come to the market.
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - as a non-public api (@DeveloperAPI).
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via 
 master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
 implementation containing the existing code from SparkContext, thus allowing 
 current 
 (corresponding) methods of SparkContext to delegate to such implementation 
 ensuring binary and source compatibility with older versions of Spark.  
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc and pull request for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.
Please see the attached design doc and pull request for more details.

  was:
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.
Please see the attached design doc and pull request for more details.


 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 

[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
JobExecutionContext by either implementing it from scratch or extending form 
DefaultExecutionContext.
Please see the attached design doc and pull request for more details.

  was:
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.
Please see the attached design doc and pull request for more details.


 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 

[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal: 
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark. 
The trait will define 4 only operations: 
* hadoopFile 
* newAPIHadoopFile 
* broadcast 
* runJob 

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext. 

Please see the attached design doc for more details. 
Pull Request will be posted shortly as well

  was:
Currently Spark _integrates with external resource-managing platforms_ such as 
Apache Hadoop YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those 
platforms. 

However, this integration is tightly coupled within Spark's implementation 
making it rather difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's 
core (see comments below for more details). 

In addition, Spark _does not provide any integration points to a third-party 
**DAG-like** and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
and monitoring and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, 
inability to gain access to such features are starting to affect Spark's 
viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned 
above ultimately benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data 
platforms as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
implementation containing the existing code from SparkContext, thus allowing 
current 
(corresponding) methods of SparkContext to delegate to such implementation 
ensuring binary and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of 
JobExecutionContext by either implementing it from scratch or extending form 
DefaultExecutionContext.
Please see the attached design doc and pull request for more details.


 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext 

[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Summary: Native Hadoop/YARN integration for batch/ETL workloads.  (was: 
Expose pluggable architecture to facilitate native integration with third-party 
execution environments.)

 Native Hadoop/YARN integration for batch/ETL workloads.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159593#comment-14159593
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

After giving it some thought, I am changing the title and the description back 
to the original as it would be more appropriate to discuss whatever question 
anyone may have via comments.

 Native Hadoop/YARN integration for batch/ETL workloads.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158614#comment-14158614
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

[~sandyr]

Indeed YARN is a _resource manager_ that supports multiple execution 
environments by helping with resource allocation and management. On the other 
hand, Spark, Tez and many other (custom) execution environments are currently 
run on YARN. (NOTE: Custom execution environments on YARN are becoming very 
common in large enterprises). Such decoupling will ensure that Spark can 
integrate with any and all (where applicable) in a pluggable and extensible 
fashion. 

 Decouple Spark's API from its execution engine
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158614#comment-14158614
 ] 

Oleg Zhurakousky edited comment on SPARK-3561 at 10/3/14 10:34 PM:
---

[~sandyr]

Indeed YARN is a _resource manager_ that supports multiple execution 
environments by facilitating resource allocation and management. On the other 
hand, Spark, Tez and many other (custom) execution environments are currently 
run on YARN. (NOTE: Custom execution environments on YARN are becoming very 
common in large enterprises). Such decoupling will ensure that Spark can 
integrate with any and all (where applicable) in a pluggable and extensible 
fashion. 


was (Author: ozhurakousky):
[~sandyr]

Indeed YARN is a _resource manager_ that supports multiple execution 
environments by helping with resource allocation and management. On the other 
hand, Spark, Tez and many other (custom) execution environments are currently 
run on YARN. (NOTE: Custom execution environments on YARN are becoming very 
common in large enterprises). Such decoupling will ensure that Spark can 
integrate with any and all (where applicable) in a pluggable and extensible 
fashion. 

 Decouple Spark's API from its execution engine
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-26 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149148#comment-14149148
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Thank you for the interest guys. We are working on the prototype which we will 
publish soon. 

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137326#comment-14137326
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Patrick, thanks for following up.

Indeed Spark does provide first-class extensibility mechanism at many different 
levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing 
a crucial one and that is the execution context”. And while SparkContext 
itself could easily be extended or mixed in with a custom trait to achieve such 
customization, it is less then ideal extension mechanism, since it would 
require code modification every time user wants to swap an execution 
environment (e.g., from “local” in testing to “yarn” in prod). 
And in fact Spark already supports an externally configurable model where the 
target execution environment is managed through “master URL. However, the 
_nature_, _implementation_ and most importantly _customization_ of these 
environments are internal to Spark. 
{code}
master match {
  case yarn-client =
  case mesosUrl @ MESOS_REGEX(_) =
  . . .
}
{code}
Further more, any additional integration and/or customization work that may 
come in the future would require modification to the above _case_ statement 
which I am also sure you’d agree is less then ideal integration style, since it 
would require a new release of Spark every time new _case_ statement is added. 
So essentially what we’re proposing is to formalize what has always been 
supported by Spark to an externally configurable model so customization around 
_*native functionality*_ of the target execution environment could be handled 
in a flexible and pluggable way.

So in this model we are simply proposing a variation of the chain of 
responsibility pattern” where DAG execution could be delegated to an _execution 
context_ with no change to end user programs or semantics. 
Based on our investigation we’ve identified 4 core operations which you can see 
in _JobExecutionContext_.
Two of them provide access to source RDD creation thus allowing customization 
of data _sourcing_ (custom readers, direct block access etc.).  One for 
_broadcast_ to integrate with broadcast capabilities provided natively. And 
last but not least is the main _execution delegate_ for the job - “runJob”.

And while I am sure there will be more questions, I hope the above response 
clarifies the overall intention of this proposal



 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137326#comment-14137326
 ] 

Oleg Zhurakousky edited comment on SPARK-3561 at 9/17/14 2:55 PM:
--

Patrick, thanks for following up.

Indeed Spark does provide first-class extensibility mechanism at many different 
levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing 
a crucial one and that is the execution context”. And while SparkContext 
itself could easily be extended or mixed in with a custom trait to achieve such 
customization, it is less then ideal extension mechanism, since it would 
require code modification every time user wants to swap an execution 
environment (e.g., from “local” in testing to “yarn” in prod) if such 
environment is not supported.
And in fact Spark already supports an externally configurable model where the 
target execution environment is managed through “master URL. However, the 
_nature_, _implementation_ and most importantly _customization_ of these 
environments are internal to Spark. 
{code}
master match {
  case yarn-client =
  case mesosUrl @ MESOS_REGEX(_) =
  . . .
}
{code}
Further more, any additional integration and/or customization work that may 
come in the future would require modification to the above _case_ statement 
which I am also sure you’d agree is less then ideal integration style, since it 
would require a new release of Spark every time new _case_ statement is added. 
So essentially what we’re proposing is to formalize what has always been 
supported by Spark to an externally configurable model so customization around 
_*native functionality*_ of the target execution environment could be handled 
in a flexible and pluggable way.

So in this model we are simply proposing a variation of the chain of 
responsibility pattern” where DAG execution could be delegated to an _execution 
context_ with no change to end user programs or semantics. 
Based on our investigation we’ve identified 4 core operations which you can see 
in _JobExecutionContext_.
Two of them provide access to source RDD creation thus allowing customization 
of data _sourcing_ (custom readers, direct block access etc.).  One for 
_broadcast_ to integrate with broadcast capabilities provided natively. And 
last but not least is the main _execution delegate_ for the job - “runJob”.

And while I am sure there will be more questions, I hope the above response 
clarifies the overall intention of this proposal




was (Author: ozhurakousky):
Patrick, thanks for following up.

Indeed Spark does provide first-class extensibility mechanism at many different 
levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing 
a crucial one and that is the execution context”. And while SparkContext 
itself could easily be extended or mixed in with a custom trait to achieve such 
customization, it is less then ideal extension mechanism, since it would 
require code modification every time user wants to swap an execution 
environment (e.g., from “local” in testing to “yarn” in prod). 
And in fact Spark already supports an externally configurable model where the 
target execution environment is managed through “master URL. However, the 
_nature_, _implementation_ and most importantly _customization_ of these 
environments are internal to Spark. 
{code}
master match {
  case yarn-client =
  case mesosUrl @ MESOS_REGEX(_) =
  . . .
}
{code}
Further more, any additional integration and/or customization work that may 
come in the future would require modification to the above _case_ statement 
which I am also sure you’d agree is less then ideal integration style, since it 
would require a new release of Spark every time new _case_ statement is added. 
So essentially what we’re proposing is to formalize what has always been 
supported by Spark to an externally configurable model so customization around 
_*native functionality*_ of the target execution environment could be handled 
in a flexible and pluggable way.

So in this model we are simply proposing a variation of the chain of 
responsibility pattern” where DAG execution could be delegated to an _execution 
context_ with no change to end user programs or semantics. 
Based on our investigation we’ve identified 4 core operations which you can see 
in _JobExecutionContext_.
Two of them provide access to source RDD creation thus allowing customization 
of data _sourcing_ (custom readers, direct block access etc.).  One for 
_broadcast_ to integrate with broadcast capabilities provided natively. And 
last but not least is the main _execution delegate_ for the job - “runJob”.

And while I am sure there will be more questions, I hope the above response 
clarifies the overall intention of this proposal



 Native Hadoop/YARN integration for batch/ETL workloads
 

[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137378#comment-14137378
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Patrick, sorry as I feel like I missed the core emphasis of what we are trying 
to accomplish with this.
Our main goal is to expose Spark to native Hadoop features (i.e., stateless 
YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark 
such as interactive, in-memory and streaming to batch and ETL in a shared, 
multi-tenant environments, thus benefiting Spark community considerably by 
allowing Spark to be applied for all use-cases and capabilties on and in Hadoop.

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137378#comment-14137378
 ] 

Oleg Zhurakousky edited comment on SPARK-3561 at 9/17/14 3:24 PM:
--

Patrick, sorry as I feel like I missed the core emphasis of what we are trying 
to accomplish with this.
As described in the design document attached, our main goal is to expose Spark 
to native Hadoop features (i.e., stateless YARN shuffle, Tez etc.), thus 
increasing the existing capabilities of Spark such as interactive, in-memory 
and streaming to batch and ETL in a shared, multi-tenant environments, thus 
benefiting Spark community considerably by allowing Spark to be applied for all 
use-cases and capabilties on and in Hadoop.


was (Author: ozhurakousky):
Patrick, sorry as I feel like I missed the core emphasis of what we are trying 
to accomplish with this.
Our main goal is to expose Spark to native Hadoop features (i.e., stateless 
YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark 
such as interactive, in-memory and streaming to batch and ETL in a shared, 
multi-tenant environments, thus benefiting Spark community considerably by 
allowing Spark to be applied for all use-cases and capabilties on and in Hadoop.

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-16 Thread Oleg Zhurakousky (JIRA)
Oleg Zhurakousky created SPARK-3561:
---

 Summary: Native Hadoop/YARN integration for batch/ETL workloads
 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
 Fix For: 1.2.0


Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob
Each method directly maps to the corresponding methods in current version of 
SparkContext. HadoopExecutionContext implementation will be accessed by 
SparkContext via “spark.hadoop.execution.context” property with default 
implementation containing the existing code from SparkContext, thus allowing 
current (corresponding) methods of SparkContext to delegate to such 
implementation. An integrator will now have an option to provide custom 
implementation of HadoopExecutionContext by either implementing it from scratch 
or extending form DefaultHadoopExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-16 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Attachment: Spark_3561.pdf

Detailed design document

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: Spark_3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. HadoopExecutionContext implementation will be accessed by 
 SparkContext via “spark.hadoop.execution.context” property with default 
 implementation containing the existing code from SparkContext, thus allowing 
 current (corresponding) methods of SparkContext to delegate to such 
 implementation. An integrator will now have an option to provide custom 
 implementation of HadoopExecutionContext by either implementing it from 
 scratch or extending form DefaultHadoopExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-16 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

** Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. HadoopExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
HadoopExecutionContext by either implementing it from scratch or extending form 
DefaultHadoopExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well

  was:
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

**Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. HadoopExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
HadoopExecutionContext by either implementing it from scratch or extending form 
DefaultHadoopExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well


 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: Spark_3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 ** Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. HadoopExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 HadoopExecutionContext by either implementing it from scratch or extending 
 form DefaultHadoopExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by 

[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-16 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Description: 
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well

  was:
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultHadoopExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well


 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: Spark_3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-16 Thread Oleg Zhurakousky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Zhurakousky updated SPARK-3561:

Attachment: (was: Spark_3561.pdf)

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org