[
https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199696#comment-15199696
]
Jonathan Kelly commented on MAHOUT-1762:
----------------------------------------
Why was using spark-submit voted down? (And where? On a JIRA or on the mailing
list?) Was it only voted down for now (e.g., due to a time constraint), or are
you not planning on switching ever?
I think using spark-submit is Spark's recommended way of invoking Spark, even
for something like Mahout on Spark. Zeppelin and spark-jobserver used to do
something similar to what Mahout on Spark is doing now but have long since
switched to using spark-submit. I'm not too familiar with Hive on Spark, but
it looks from a quick glance at the source that it is also using spark-submit.
In short, I'd really suggest using spark-submit for Mahout as well, at least in
order to match what most other apps are doing and in order to follow best
practices.
> Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
> -------------------------------------------------------
>
> Key: MAHOUT-1762
> URL: https://issues.apache.org/jira/browse/MAHOUT-1762
> Project: Mahout
> Issue Type: Improvement
> Components: spark
> Reporter: Sergey Tryuber
> Assignee: Pat Ferrel
> Fix For: 1.0.0
>
>
> [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]
> is aimed to contain global configuration for Spark cluster. For example, in
> our HDP2.2 environment it contains:
> {noformat}
> spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
> {noformat}
> and there are many other good things. Actually it is expected that when a
> user starts Spark Shell, it will be working fine. Unfortunately this does not
> happens with Mahout Spark Shell, because it ignores spark configuration and
> user has to copy-past lots of options into _MAHOUT_OPTS_.
> This happens because
> [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala]
> is executed directly in [initialization
> script|https://github.com/apache/mahout/blob/master/bin/mahout]:
> {code}
> "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH"
> "org.apache.mahout.sparkbindings.shell.Main" $@
> {code}
> In contrast, in Spark shell is indirectly invoked through spark-submit in
> [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell]
> script:
> {code}
> "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@"
> {code}
> [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala]
> contains an additional initialization layer for loading properties file (see
> SparkSubmitArguments#mergeDefaultSparkProperties method).
> So there are two possible solutions:
> * use proper Spark-like initialization logic
> * use thin envelope like it is in H2O Sparkling Water
> ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell])
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)