ETL workloads

Oleg Zhurakousky (JIRA) Tue, 16 Sep 2014 19:50:51 -0700

Oleg Zhurakousky created SPARK-3561:
---------------------------------------


             Summary: Native Hadoop/YARN integration for batch/ETL workloads
                 Key: SPARK-3561
                 URL: https://issues.apache.org/jira/browse/SPARK-3561
             Project: Spark
          Issue Type: New Feature
          Components: core
    Affects Versions: 1.1.0
            Reporter: Oleg Zhurakousky
             Fix For: 1.2.0


Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob
Each method directly maps to the corresponding methods in current version of 
SparkContext. HadoopExecutionContext implementation will be accessed by 
SparkContext via “spark.hadoop.execution.context” property with default 
implementation containing the existing code from SparkContext, thus allowing 
current (corresponding) methods of SparkContext to delegate to such 
implementation. An integrator will now have an option to provide custom 
implementation of HadoopExecutionContext by either implementing it from scratch 
or extending form DefaultHadoopExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

Reply via email to