[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads.
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Summary: Native Hadoop/YARN integration for batch/ETL workloads. (was: Expose pluggable architecture to facilitate native integration with third-party execution environments.) > Native Hadoop/YARN integration for batch/ETL workloads. > --- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: SPARK-3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. JobExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > DefaultExecutionContext by either implementing it from scratch or extending > form DefaultExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Attachment: SPARK-3561.pdf > Native Hadoop/YARN integration for batch/ETL workloads > -- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: SPARK-3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. JobExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > DefaultExecutionContext by either implementing it from scratch or extending > form DefaultExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Attachment: (was: Spark_3561.pdf) > Native Hadoop/YARN integration for batch/ETL workloads > -- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. JobExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > DefaultExecutionContext by either implementing it from scratch or extending > form DefaultExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext" with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext" with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well > Native Hadoop/YARN integration for batch/ETL workloads > -- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: Spark_3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. JobExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > DefaultExecutionContext by either implementing it from scratch or extending > form DefaultExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly as well -- This me
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext" with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext" with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well > Native Hadoop/YARN integration for batch/ETL workloads > -- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: Spark_3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. JobExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > DefaultExecutionContext by either implementing it from scratch or extending > form DefaultHadoopExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly as well
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. **Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext" with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via “spark.hadoop.execution.context” property with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well > Native Hadoop/YARN integration for batch/ETL workloads > -- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: Spark_3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > **Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. HadoopExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > HadoopExecutionContext by either implementing it from scratch or extending > form DefaultHadoopExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly as well -- This mess
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext" with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. ** Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext" with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well > Native Hadoop/YARN integration for batch/ETL workloads > -- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: Spark_3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. HadoopExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > HadoopExecutionContext by either implementing it from scratch or extending > form DefaultHadoopExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Description: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. ** Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext" with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. **Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. HadoopExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext" with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of HadoopExecutionContext by either implementing it from scratch or extending form DefaultHadoopExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well > Native Hadoop/YARN integration for batch/ETL workloads > -- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: Spark_3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > ** Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. HadoopExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > HadoopExecutionContext by either implementing it from scratch or extending > form DefaultHadoopExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted sho
[jira] [Updated] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Zhurakousky updated SPARK-3561: Attachment: Spark_3561.pdf Detailed design document > Native Hadoop/YARN integration for batch/ETL workloads > -- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: Spark_3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. HadoopExecutionContext implementation will be accessed by > SparkContext via “spark.hadoop.execution.context” property with default > implementation containing the existing code from SparkContext, thus allowing > current (corresponding) methods of SparkContext to delegate to such > implementation. An integrator will now have an option to provide custom > implementation of HadoopExecutionContext by either implementing it from > scratch or extending form DefaultHadoopExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org