[
https://issues.apache.org/jira/browse/SPARK-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581305#comment-14581305
]
Marcelo Vanzin commented on SPARK-8302:
---------------------------------------
So I spent some time on this and tried two different approaches.
* Approach #1: make configuration more dynamic.
This means that you could set up your local spark configuration to reference
variables that would be replaced when actually launching process. So, to use
the example above, you'd set {{SPARK_DIST_CLASSPATH}} to something like
{{${env.HADOOP_INSTALL_DIR}/lib/*}}, and that configuration would be expanded
when launching processes.
Turned out this option ended up causing way too many code changes; you need the
launcher library, the standalone backend, the mesos backend and finally YARN to
know about what configs could contain variables and how to replace them with
what values, which ended up touching a lot of code. Since the feature is not
really that useful for anything but YARN, I decided to try a different approach.
* Approach #2: perform some targeted path replacement in the YARN client code
This was my attempt to restrict changes to a minimum just to cover the YARN
case. I ended up with some simple code that allows path's to be replaced with
an alternative in any variables that affect a remote process's command line /
environment (such as classpaths and native library paths).
I'll send a PR for #2 shortly, I hope that clears up what I mean by the above.
> Support heterogeneous cluster nodes on YARN
> -------------------------------------------
>
> Key: SPARK-8302
> URL: https://issues.apache.org/jira/browse/SPARK-8302
> Project: Spark
> Issue Type: New Feature
> Components: YARN
> Affects Versions: 1.5.0
> Reporter: Marcelo Vanzin
>
> Some of our customers install Hadoop on different paths across the cluster.
> When running a Spark app, this leads to a few complications because of how we
> try to reuse the rest of Hadoop.
> Since all configuration for a Spark-on-YARN application is local, the code
> does not have enough information about how to run things on the rest of the
> cluster in such cases.
> To illustrate: let's say that a node's configuration says that
> {{SPARK_DIST_CLASSPATH=/disk1/hadoop/lib/*}}. If I launch a Spark app from
> that machine, but there's a machine on the cluster where Hadoop is actually
> installed in {{/disk2/hadoop/lib}}, then any container launched on that node
> will fail.
> The problem does not exist (or is much less pronounced) on standalone and
> mesos since they require a local Spark installation and configuration.
> It would be nice if we could easily support this use case on YARN.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]