[jira] [Commented] (SPARK-8302) Support heterogeneous cluster nodes on YARN

Marcelo Vanzin (JIRA) Wed, 10 Jun 2015 18:01:19 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581305#comment-14581305
 ]


Marcelo Vanzin commented on SPARK-8302:
---------------------------------------

So I spent some time on this and tried two different approaches.

* Approach #1: make configuration more dynamic.

This means that you could set up your local spark configuration to reference 
variables that would be replaced when actually launching process. So, to use 
the example above, you'd set {{SPARK_DIST_CLASSPATH}} to something like 
{{${env.HADOOP_INSTALL_DIR}/lib/*}}, and that configuration would be expanded 
when launching processes.

Turned out this option ended up causing way too many code changes; you need the 
launcher library, the standalone backend, the mesos backend and finally YARN to 
know about what configs could contain variables and how to replace them with 
what values, which ended up touching a lot of code. Since the feature is not 
really that useful for anything but YARN, I decided to try a different approach.

* Approach #2: perform some targeted path replacement in the YARN client code

This was my attempt to restrict changes to a minimum just to cover the YARN 
case. I ended up with some simple code that allows path's to be replaced with 
an alternative in any variables that affect a remote process's command line / 
environment (such as classpaths and native library paths).

I'll send a PR for #2 shortly, I hope that clears up what I mean by the above.

> Support heterogeneous cluster nodes on YARN
> -------------------------------------------
>
>                 Key: SPARK-8302
>                 URL: https://issues.apache.org/jira/browse/SPARK-8302
>             Project: Spark
>          Issue Type: New Feature
>          Components: YARN
>    Affects Versions: 1.5.0
>            Reporter: Marcelo Vanzin
>
> Some of our customers install Hadoop on different paths across the cluster. 
> When running a Spark app, this leads to a few complications because of how we 
> try to reuse the rest of Hadoop.
> Since all configuration for a Spark-on-YARN application is local, the code 
> does not have enough information about how to run things on the rest of the 
> cluster in such cases.
> To illustrate: let's say that a node's configuration says that 
> {{SPARK_DIST_CLASSPATH=/disk1/hadoop/lib/*}}. If I launch a Spark app from 
> that machine, but there's a machine on the cluster where Hadoop is actually 
> installed in {{/disk2/hadoop/lib}}, then any container launched on that node 
> will fail.
> The problem does not exist (or is much less pronounced) on standalone and 
> mesos since they require a local Spark installation and configuration.
> It would be nice if we could easily support this use case on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8302) Support heterogeneous cluster nodes on YARN

Reply via email to