Re: Spark on YARN question
I’ve put my Spark JAR into HDFS, and specify the SPARK_JAR variable to point to the HDFS location of the jar. I’m not using any specialized configuration files (like spark-env.sh), but rather setting things either by environment variable per node, passing application arguments to the job, or making a Zookeeper connection from my job to seed properties. From there, I can construct a SparkConf as necessary. mn On Sep 2, 2014, at 9:06 AM, Greg Hill greg.h...@rackspace.com wrote: I'm working on setting up Spark on YARN using the HDP technical preview - http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/ I have installed the Spark JARs on all the slave nodes and configured YARN to find the JARs. It seems like everything is working. Unless I'm misunderstanding, it seems like there isn't any configuration required on the YARN slave nodes at all, apart from telling YARN where to find the Spark JAR files. Do the YARN processes even pick up local Spark configuration files on the slave nodes, or is that all just pulled in on the client and passed along to YARN? Greg
Re: Spark on YARN question
Hi Greg, You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends on through the --jars argument if you are using spark-submit / spark-shell / pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV on the driver node (where your application is submitted) to specify environment variables to be observed by your executors. If you are using the spark-submit / spark-shell / pyspark scripts, then you can set Spark properties in the conf/spark-defaults.conf properties file, and these will be propagated to the executors. In other words, configurations on the slave nodes don't do anything. For example, $ vim conf/spark-defaults.conf // set a few properties $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2 $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2 Best, -Andrew
Re: Spark on YARN question
Thanks. That sounds like how I was thinking it worked. I did have to install the JARs on the slave nodes for yarn-cluster mode to work, FWIW. It's probably just whichever node ends up spawning the application master that needs it, but it wasn't passed along from spark-submit. Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Tuesday, September 2, 2014 11:05 AM To: Matt Narrell matt.narr...@gmail.commailto:matt.narr...@gmail.com Cc: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark on YARN question Hi Greg, You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends on through the --jars argument if you are using spark-submit / spark-shell / pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV on the driver node (where your application is submitted) to specify environment variables to be observed by your executors. If you are using the spark-submit / spark-shell / pyspark scripts, then you can set Spark properties in the conf/spark-defaults.conf properties file, and these will be propagated to the executors. In other words, configurations on the slave nodes don't do anything. For example, $ vim conf/spark-defaults.conf // set a few properties $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2 $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2 Best, -Andrew
Re: Spark on YARN question
Hello friends: I have a follow-up to Andrew's well articulated answer below (thank you for that). (1) I've seen both of these invocations in various places: (a) '--master yarn' (b) '--master yarn-client' the latter of which doesn't appear in '/pyspark//|//spark-submit|spark-shell --help/' output. Is case (a) meant for cluster-mode apps (where the driver is out on a YARN ApplicationMaster, and case (b) for client-mode apps needing client interaction locally? Also (related), is case (b) simply shorthand for the following invocation syntax? '--master yarn --deploy-mode client' (2) Seeking clarification on the first sentence below... /Note: To avoid a copy of the Assembly JAR every time I launch a job, I place it (the lat//est// //version) at a specific (but otherwise arbitrary) location on HDFS, and then set SPARK_JAR, like so (//where you can thankfully use wild-cards//)//:// // // export SPARK_JAR=hdfs://namenode:8020///path/to///spark-assembly-*.jar/ But my question here is, when specifying additional JARS like this '--jars /path/to/jar1,/path/to/jar2,...' to /pyspark|spark-submit|spark-shell/ commands, are those JARS expected to *already* be at those path locations on both the _submitter_ server, as well as on YARN _worker_ servers? In other words, the '--jars' option won't cause the command to look for them locally at those path locations, and then ship place them to the same path-locations remotely? They need to be there already, both locally and remotely. Correct? Thank you. :) didata On 09/02/2014 12:05 PM, Andrew Or wrote: Hi Greg, You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends on through the --jars argument if you are using spark-submit / spark-shell / pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV on the driver node (where your application is submitted) to specify environment variables to be observed by your executors. If you are using the spark-submit / spark-shell / pyspark scripts, then you can set Spark properties in the conf/spark-defaults.conf properties file, and these will be propagated to the executors. In other words, configurations on the slave nodes don't do anything. For example, $ vim conf/spark-defaults.conf // set a few properties $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2 $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2 Best, -Andrew
Re: Spark on YARN question
Hi Didata, (1) Correct. The default deploy mode is `client`, so both masters `yarn` and `yarn-client` run Spark in client mode. If you explicitly specify master as `yarn-cluster`, Spark will run in cluster mode. If you implicitly specify one deploy mode through the master (e.g. yarn-client) but set deploy mode to the opposite (e.g. cluster), Spark will complain and throw an exception. :) (2) The jars passed through the `--jars` option only need to be visible to the spark-submit program. Depending on the deploy mode, they will be propagated to the containers (i.e. the executors, and the driver in cluster mode) differently so you don't need to manually copy them yourself, either through rsync'ing or uploading to HDFS. Another thing is that SPARK_JAR is technically deprecated (you should get a warning for using it). Instead, you can set spark.yarn.jar in your conf/spark-defaults.conf on the submitter node. Let me know if you have more questions, -Andrew 2014-09-02 15:12 GMT-07:00 Dimension Data, LLC. subscripti...@didata.us: Hello friends: I have a follow-up to Andrew's well articulated answer below (thank you for that). (1) I've seen both of these invocations in various places: (a) '--master yarn' (b) '--master yarn-client' the latter of which doesn't appear in '*pyspark**|**spark-submit|spark-shell --help*' output. Is case (a) meant for cluster-mode apps (where the driver is out on a YARN ApplicationMaster, and case (b) for client-mode apps needing client interaction locally? Also (related), is case (b) simply shorthand for the following invocation syntax? '--master yarn --deploy-mode client' (2) Seeking clarification on the first sentence below... *Note: To avoid a copy of the Assembly JAR every time I launch a job, I place it (the lat**est* *version) at a specific (but otherwise arbitrary) location on HDFS, and then set SPARK_JAR, like so (**where you can thankfully use wild-cards**)**:* * export SPARK_JAR=hdfs://namenode:8020/**path/to* */spark-assembly-*.jar* But my question here is, when specifying additional JARS like this '--jars /path/to/jar1,/path/to/jar2,...' to *pyspark|spark-submit|spark-shell* commands, are those JARS expected to *already* be at those path locations on both the _submitter_ server, as well as on YARN _worker_ servers? In other words, the '--jars' option won't cause the command to look for them locally at those path locations, and then ship place them to the same path-locations remotely? They need to be there already, both locally and remotely. Correct? Thank you. :) didata On 09/02/2014 12:05 PM, Andrew Or wrote: Hi Greg, You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends on through the --jars argument if you are using spark-submit / spark-shell / pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV on the driver node (where your application is submitted) to specify environment variables to be observed by your executors. If you are using the spark-submit / spark-shell / pyspark scripts, then you can set Spark properties in the conf/spark-defaults.conf properties file, and these will be propagated to the executors. In other words, configurations on the slave nodes don't do anything. For example, $ vim conf/spark-defaults.conf // set a few properties $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2 $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2 Best, -Andrew