[ https://issues.apache.org/jira/browse/SPARK-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581305#comment-14581305 ]
Marcelo Vanzin commented on SPARK-8302: --------------------------------------- So I spent some time on this and tried two different approaches. * Approach #1: make configuration more dynamic. This means that you could set up your local spark configuration to reference variables that would be replaced when actually launching process. So, to use the example above, you'd set {{SPARK_DIST_CLASSPATH}} to something like {{${env.HADOOP_INSTALL_DIR}/lib/*}}, and that configuration would be expanded when launching processes. Turned out this option ended up causing way too many code changes; you need the launcher library, the standalone backend, the mesos backend and finally YARN to know about what configs could contain variables and how to replace them with what values, which ended up touching a lot of code. Since the feature is not really that useful for anything but YARN, I decided to try a different approach. * Approach #2: perform some targeted path replacement in the YARN client code This was my attempt to restrict changes to a minimum just to cover the YARN case. I ended up with some simple code that allows path's to be replaced with an alternative in any variables that affect a remote process's command line / environment (such as classpaths and native library paths). I'll send a PR for #2 shortly, I hope that clears up what I mean by the above. > Support heterogeneous cluster nodes on YARN > ------------------------------------------- > > Key: SPARK-8302 > URL: https://issues.apache.org/jira/browse/SPARK-8302 > Project: Spark > Issue Type: New Feature > Components: YARN > Affects Versions: 1.5.0 > Reporter: Marcelo Vanzin > > Some of our customers install Hadoop on different paths across the cluster. > When running a Spark app, this leads to a few complications because of how we > try to reuse the rest of Hadoop. > Since all configuration for a Spark-on-YARN application is local, the code > does not have enough information about how to run things on the rest of the > cluster in such cases. > To illustrate: let's say that a node's configuration says that > {{SPARK_DIST_CLASSPATH=/disk1/hadoop/lib/*}}. If I launch a Spark app from > that machine, but there's a machine on the cluster where Hadoop is actually > installed in {{/disk2/hadoop/lib}}, then any container launched on that node > will fail. > The problem does not exist (or is much less pronounced) on standalone and > mesos since they require a local Spark installation and configuration. > It would be nice if we could easily support this use case on YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org