[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514062#comment-14514062 ] Oleg Zhurakousky commented on SPARK-3561: - Here is an interesting read that provides an ever stronger case for separating flow construction vs execution context. http://blog.acolyer.org/2015/04/27/musketeer-part-i-whats-the-best-data-processing-system/ and http://www.cl.cam.ac.uk/research/srg/netos/camsas/pubs/eurosys15-musketeer.pdf The key points are: _It thus makes little sense to force the user to target a single system at workflow implementation time. Instead, we argue that users should, in principle, be able to execute their high-level workflow on any data processing system (§3). Being able to do this has three main benefits:_ _1. Users write their workflow once, in a way they choose, but can easily execute it on alternative systems;_ _2. Multiple sub-components of a workflow can be executed on different back-end systems; and_ _3. Existing workflows can easily be ported to new systems._ Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265113#comment-14265113 ] Oleg Zhurakousky commented on SPARK-3561: - Sorry for the delay in response, I'll just blame the holidays ;) No, I have not had a chance to run the elasticity tests against 1.2, so I am gonna have to follow up on that. The main motivation for this proposal is to _formalize an extension model around Spark’s execution environment_ to allow other execution environments (new and existing) to be easily plugged-in by a system integrator without requiring a new release of Spark (giving current integration mechanism which relies on ‘case’ statement with hard-coded values). Reasons for _why this is necessary?_ are many, but could all be summarized around an old **_generalization_** vs. **_specialization_** argument. And while _Tez, elastic scaling, utilization of cluster resources_ are all good examples and indeed were the initial motivators, they are certainly not the end and current efforts of several clients of ours who are integrating Spark with their custom execution environments using the proposed approach is a good evidence of its viability and an obvious benefit to Spark’s technology, allowing it to become a developer friendly “face” of many execution environments/technologies while continuing innovation of its own. So I think the next logical step would be to gather “for” and “against” arguments around pluggable execution context for Spark” in general, then we can discuss implementation. Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190337#comment-14190337 ] Oleg Zhurakousky commented on SPARK-3561: - Just as an FYI The POC code for Tez-based reference implementation of aforementioned _execution context_ is available - https://github.com/hortonworks/spark-native-yarn together with a samples project - https://github.com/hortonworks/spark-native-yarn-samples. Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186987#comment-14186987 ] Oleg Zhurakousky commented on SPARK-3561: - [~vanzin] I would not call it hard (we have done it in the initial POC by simply mixing custom trait into SC - essentially extending it), however I do agree that a lot of Spark's initialization would still happen due to the implementation of SC itself thus creating and initializing some of the artifacts that may not be used with different execution context. Question; Why was it done like this and not pushed into some SC.init operation? Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176501#comment-14176501 ] Oleg Zhurakousky commented on SPARK-3561: - Resubmitting Pull Request (squashed) which has additional 2 methods added to the _JobExecutionContext_ to support caching. - https://github.com/apache/spark/pull/2849 Prototype will be published shortly as well Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This
[jira] [Updated] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166705#comment-14166705 ] Oleg Zhurakousky commented on SPARK-3561: - Patrick, I think there is misunderstanding about the mechanics of this proposal, so I'd like to clarify. The proposal here is certainly not to introduce any new dependencies to Spark Core and existing pull request (https://github.com/apache/spark/pull/2422) clearly shows it. What I am proposing is to expose an integration point in Spark by means of extracting *existing* Spark operations into a *configurable and @Experimental* strategy, allowing Spark not only to integrate with other execution environments, but it would also be very useful in unit-testing as it would provide a clear separation between _assembly_ and _execution_ layer allowing them to be tested in isolation. I think this feature would benefit Spark tremendously; particularly given how several folks have already expressed their interest in this feature/direction. Appreciate your help and advise in helping to get this contribution into Spark. Thanks! Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166705#comment-14166705 ] Oleg Zhurakousky edited comment on SPARK-3561 at 10/10/14 12:10 PM: Patrick, I think there is misunderstanding about the mechanics of this proposal, so I'd like to clarify. The proposal here is certainly not to introduce any new dependencies to Spark Core and existing pull request (https://github.com/apache/spark/pull/2422) clearly shows it. What I am proposing is to expose an integration point in Spark by means of extracting *existing* Spark operations into a *configurable and @Experimental* strategy, allowing Spark not only to integrate with other execution contexts, but it would also be very useful in unit-testing as it would provide a clear separation between _assembly_ and _execution_ layer allowing them to be tested in isolation. I think this feature would benefit Spark tremendously; particularly given how several folks have already expressed their interest in this feature/direction. Appreciate your help and advise in helping to get this contribution into Spark. Thanks! was (Author: ozhurakousky): Patrick, I think there is misunderstanding about the mechanics of this proposal, so I'd like to clarify. The proposal here is certainly not to introduce any new dependencies to Spark Core and existing pull request (https://github.com/apache/spark/pull/2422) clearly shows it. What I am proposing is to expose an integration point in Spark by means of extracting *existing* Spark operations into a *configurable and @Experimental* strategy, allowing Spark not only to integrate with other execution environments, but it would also be very useful in unit-testing as it would provide a clear separation between _assembly_ and _execution_ layer allowing them to be tested in isolation. I think this feature would benefit Spark tremendously; particularly given how several folks have already expressed their interest in this feature/direction. Appreciate your help and advise in helping to get this contribution into Spark. Thanks! Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162013#comment-14162013 ] Oleg Zhurakousky commented on SPARK-3561: - Patrick, your point about confusion with other JIRAs makes sense. Thanks. With regard to detailed design, can you please let me know what you are looking for and what would be useful? The only changes I'm proposing is the addition of an @Experimental interface with the 4 methods for the reasons stated in the design doc. For example: * Would it be useful if I sent another PR with the implementation of the interface? * Would it be useful if I shared benchmarks which showcase some of the benefits of alternative execution for Batch/ETL scenarios? Since this is my first involvement in the Spark community, I appreciate your guidance and I'm happy to provide any details you might find useful. Thanks! Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159559#comment-14159559 ] Oleg Zhurakousky commented on SPARK-3561: - Sandy, one other thing: While I understand the reasoning for changes to the title and the description of the JIRA, it would probably be better to coordinate this with the original submitter before making such changes in the future (similar to the way Patric suggested in SPARK-3174). This would alleviate potential discrepancies in the overall message and intentions of the JIRA. Anyway, I’ve edited both the title and the description taking into consideration your edits. Expose pluggable architecture to facilitate native integration with third-party execution environments. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. was: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. Expose pluggable architecture to facilitate native integration with third-party execution environments.
[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of JobExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. was: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. Expose pluggable architecture to facilitate native integration with third-party execution environments.
[jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop YARN and Mesos to facilitate execution of Spark DAG in a distributed environment provided by those platforms. However, this integration is tightly coupled within Spark's implementation making it rather difficult to introduce integration points with other resource-managing platforms without constant modifications to Spark's core (see comments below for more details). In addition, Spark _does not provide any integration points to a third-party **DAG-like** and **DAG-capable** execution environments_ native to those platforms, thus limiting access to some of their native features (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring and more) as well as specialization aspects of such execution environments (open source and proprietary). As an example, inability to gain access to such features are starting to affect Spark's viability in large scale, batch and/or ETL applications. Introducing a pluggable architecture would solve both of the issues mentioned above ultimately benefitting Spark's technology and community by allowing it to venture into co-existence and collaboration with a variety of existing Big Data platforms as well as the once yet to come to the market. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI). The trait will define 4 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation ensuring binary and source compatibility with older versions of Spark. An integrator will now have an option to provide custom implementation of JobExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc and pull request for more details. Expose pluggable architecture to facilitate native integration with third-party execution environments. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Summary: Native Hadoop/YARN integration for batch/ETL workloads. (was: Expose pluggable architecture to facilitate native integration with third-party execution environments.) Native Hadoop/YARN integration for batch/ETL workloads. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159593#comment-14159593 ] Oleg Zhurakousky commented on SPARK-3561: - After giving it some thought, I am changing the title and the description back to the original as it would be more appropriate to discuss whatever question anyone may have via comments. Native Hadoop/YARN integration for batch/ETL workloads. --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Decouple Spark's API from its execution engine
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158614#comment-14158614 ] Oleg Zhurakousky commented on SPARK-3561: - [~sandyr] Indeed YARN is a _resource manager_ that supports multiple execution environments by helping with resource allocation and management. On the other hand, Spark, Tez and many other (custom) execution environments are currently run on YARN. (NOTE: Custom execution environments on YARN are becoming very common in large enterprises). Such decoupling will ensure that Spark can integrate with any and all (where applicable) in a pluggable and extensible fashion. Decouple Spark's API from its execution engine -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Decouple Spark's API from its execution engine
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158614#comment-14158614 ] Oleg Zhurakousky edited comment on SPARK-3561 at 10/3/14 10:34 PM: --- [~sandyr] Indeed YARN is a _resource manager_ that supports multiple execution environments by facilitating resource allocation and management. On the other hand, Spark, Tez and many other (custom) execution environments are currently run on YARN. (NOTE: Custom execution environments on YARN are becoming very common in large enterprises). Such decoupling will ensure that Spark can integrate with any and all (where applicable) in a pluggable and extensible fashion. was (Author: ozhurakousky): [~sandyr] Indeed YARN is a _resource manager_ that supports multiple execution environments by helping with resource allocation and management. On the other hand, Spark, Tez and many other (custom) execution environments are currently run on YARN. (NOTE: Custom execution environments on YARN are becoming very common in large enterprises). Such decoupling will ensure that Spark can integrate with any and all (where applicable) in a pluggable and extensible fashion. Decouple Spark's API from its execution engine -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149148#comment-14149148 ] Oleg Zhurakousky commented on SPARK-3561: - Thank you for the interest guys. We are working on the prototype which we will publish soon. Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137326#comment-14137326 ] Oleg Zhurakousky commented on SPARK-3561: - Patrick, thanks for following up. Indeed Spark does provide first-class extensibility mechanism at many different levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing a crucial one and that is the execution context”. And while SparkContext itself could easily be extended or mixed in with a custom trait to achieve such customization, it is less then ideal extension mechanism, since it would require code modification every time user wants to swap an execution environment (e.g., from “local” in testing to “yarn” in prod). And in fact Spark already supports an externally configurable model where the target execution environment is managed through “master URL. However, the _nature_, _implementation_ and most importantly _customization_ of these environments are internal to Spark. {code} master match { case yarn-client = case mesosUrl @ MESOS_REGEX(_) = . . . } {code} Further more, any additional integration and/or customization work that may come in the future would require modification to the above _case_ statement which I am also sure you’d agree is less then ideal integration style, since it would require a new release of Spark every time new _case_ statement is added. So essentially what we’re proposing is to formalize what has always been supported by Spark to an externally configurable model so customization around _*native functionality*_ of the target execution environment could be handled in a flexible and pluggable way. So in this model we are simply proposing a variation of the chain of responsibility pattern” where DAG execution could be delegated to an _execution context_ with no change to end user programs or semantics. Based on our investigation we’ve identified 4 core operations which you can see in _JobExecutionContext_. Two of them provide access to source RDD creation thus allowing customization of data _sourcing_ (custom readers, direct block access etc.). One for _broadcast_ to integrate with broadcast capabilities provided natively. And last but not least is the main _execution delegate_ for the job - “runJob”. And while I am sure there will be more questions, I hope the above response clarifies the overall intention of this proposal Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137326#comment-14137326 ] Oleg Zhurakousky edited comment on SPARK-3561 at 9/17/14 2:55 PM: -- Patrick, thanks for following up. Indeed Spark does provide first-class extensibility mechanism at many different levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing a crucial one and that is the execution context”. And while SparkContext itself could easily be extended or mixed in with a custom trait to achieve such customization, it is less then ideal extension mechanism, since it would require code modification every time user wants to swap an execution environment (e.g., from “local” in testing to “yarn” in prod) if such environment is not supported. And in fact Spark already supports an externally configurable model where the target execution environment is managed through “master URL. However, the _nature_, _implementation_ and most importantly _customization_ of these environments are internal to Spark. {code} master match { case yarn-client = case mesosUrl @ MESOS_REGEX(_) = . . . } {code} Further more, any additional integration and/or customization work that may come in the future would require modification to the above _case_ statement which I am also sure you’d agree is less then ideal integration style, since it would require a new release of Spark every time new _case_ statement is added. So essentially what we’re proposing is to formalize what has always been supported by Spark to an externally configurable model so customization around _*native functionality*_ of the target execution environment could be handled in a flexible and pluggable way. So in this model we are simply proposing a variation of the chain of responsibility pattern” where DAG execution could be delegated to an _execution context_ with no change to end user programs or semantics. Based on our investigation we’ve identified 4 core operations which you can see in _JobExecutionContext_. Two of them provide access to source RDD creation thus allowing customization of data _sourcing_ (custom readers, direct block access etc.). One for _broadcast_ to integrate with broadcast capabilities provided natively. And last but not least is the main _execution delegate_ for the job - “runJob”. And while I am sure there will be more questions, I hope the above response clarifies the overall intention of this proposal was (Author: ozhurakousky): Patrick, thanks for following up. Indeed Spark does provide first-class extensibility mechanism at many different levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing a crucial one and that is the execution context”. And while SparkContext itself could easily be extended or mixed in with a custom trait to achieve such customization, it is less then ideal extension mechanism, since it would require code modification every time user wants to swap an execution environment (e.g., from “local” in testing to “yarn” in prod). And in fact Spark already supports an externally configurable model where the target execution environment is managed through “master URL. However, the _nature_, _implementation_ and most importantly _customization_ of these environments are internal to Spark. {code} master match { case yarn-client = case mesosUrl @ MESOS_REGEX(_) = . . . } {code} Further more, any additional integration and/or customization work that may come in the future would require modification to the above _case_ statement which I am also sure you’d agree is less then ideal integration style, since it would require a new release of Spark every time new _case_ statement is added. So essentially what we’re proposing is to formalize what has always been supported by Spark to an externally configurable model so customization around _*native functionality*_ of the target execution environment could be handled in a flexible and pluggable way. So in this model we are simply proposing a variation of the chain of responsibility pattern” where DAG execution could be delegated to an _execution context_ with no change to end user programs or semantics. Based on our investigation we’ve identified 4 core operations which you can see in _JobExecutionContext_. Two of them provide access to source RDD creation thus allowing customization of data _sourcing_ (custom readers, direct block access etc.). One for _broadcast_ to integrate with broadcast capabilities provided natively. And last but not least is the main _execution delegate_ for the job - “runJob”. And while I am sure there will be more questions, I hope the above response clarifies the overall intention of this proposal Native Hadoop/YARN integration for batch/ETL workloads
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137378#comment-14137378 ] Oleg Zhurakousky commented on SPARK-3561: - Patrick, sorry as I feel like I missed the core emphasis of what we are trying to accomplish with this. Our main goal is to expose Spark to native Hadoop features (i.e., stateless YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark such as interactive, in-memory and streaming to batch and ETL in a shared, multi-tenant environments, thus benefiting Spark community considerably by allowing Spark to be applied for all use-cases and capabilties on and in Hadoop. Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137378#comment-14137378 ] Oleg Zhurakousky edited comment on SPARK-3561 at 9/17/14 3:24 PM: -- Patrick, sorry as I feel like I missed the core emphasis of what we are trying to accomplish with this. As described in the design document attached, our main goal is to expose Spark to native Hadoop features (i.e., stateless YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark such as interactive, in-memory and streaming to batch and ETL in a shared, multi-tenant environments, thus benefiting Spark community considerably by allowing Spark to be applied for all use-cases and capabilties on and in Hadoop. was (Author: ozhurakousky): Patrick, sorry as I feel like I missed the core emphasis of what we are trying to accomplish with this. Our main goal is to expose Spark to native Hadoop features (i.e., stateless YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark such as interactive, in-memory and streaming to batch and ETL in a shared, multi-tenant environments, thus benefiting Spark community considerably by allowing Spark to be applied for all use-cases and capabilties on and in Hadoop. Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
Oleg Zhurakousky created SPARK-3561: --- Summary: Native Hadoop/YARN integration for batch/ETL workloads Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Fix For: 1.2.0 Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via “spark.hadoop.execution.context” property with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Attachment: Spark_3561.pdf Detailed design document Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: Spark_3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via “spark.hadoop.execution.context” property with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. ** Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. **Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: Spark_3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. ** Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: Spark_3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Attachment: (was: Spark_3561.pdf) Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org