ETL workloads

Oleg Zhurakousky (JIRA) Wed, 17 Sep 2014 07:37:33 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137326#comment-14137326
 ]


Oleg Zhurakousky commented on SPARK-3561:
-----------------------------------------

Patrick, thanks for following up.

Indeed Spark does provide first-class extensibility mechanism at many different 
levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing 
a crucial one and that is the "execution context”. And while SparkContext 
itself could easily be extended or mixed in with a custom trait to achieve such 
customization, it is less then ideal extension mechanism, since it would 
require code modification every time user wants to swap an execution 
environment (e.g., from “local” in testing to “yarn” in prod). 
And in fact Spark already supports an externally configurable model where the 
target execution environment is managed through “master" URL. However, the 
_nature_, _implementation_ and most importantly _customization_ of these 
environments are internal to Spark. 
{code}
master match {
      case "yarn-client" =>
      case mesosUrl @ MESOS_REGEX(_) =>
      . . .
}
{code}
Further more, any additional integration and/or customization work that may 
come in the future would require modification to the above _case_ statement 
which I am also sure you’d agree is less then ideal integration style, since it 
would require a new release of Spark every time new _case_ statement is added. 
So essentially what we’re proposing is to formalize what has always been 
supported by Spark to an externally configurable model so customization around 
_*native functionality*_ of the target execution environment could be handled 
in a flexible and pluggable way.

So in this model we are simply proposing a variation of the "chain of 
responsibility pattern” where DAG execution could be delegated to an _execution 
context_ with no change to end user programs or semantics. 
Based on our investigation we’ve identified 4 core operations which you can see 
in _JobExecutionContext_.
Two of them provide access to source RDD creation thus allowing customization 
of data _sourcing_ (custom readers, direct block access etc.).  One for 
_broadcast_ to integrate with broadcast capabilities provided natively. And 
last but not least is the main _execution delegate_ for the job - “runJob”.

And while I am sure there will be more questions, I hope the above response 
clarifies the overall intention of this proposal



> Native Hadoop/YARN integration for batch/ETL workloads
> ------------------------------------------------------
>
>                 Key: SPARK-3561
>                 URL: https://issues.apache.org/jira/browse/SPARK-3561
>             Project: Spark
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 1.1.0
>            Reporter: Oleg Zhurakousky
>              Labels: features
>             Fix For: 1.2.0
>
>         Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

Reply via email to