Thanks. That sounds like how I was thinking it worked. I did have to install the JARs on the slave nodes for yarn-cluster mode to work, FWIW. It's probably just whichever node ends up spawning the application master that needs it, but it wasn't passed along from spark-submit.
Greg From: Andrew Or <and...@databricks.com<mailto:and...@databricks.com>> Date: Tuesday, September 2, 2014 11:05 AM To: Matt Narrell <matt.narr...@gmail.com<mailto:matt.narr...@gmail.com>> Cc: Greg <greg.h...@rackspace.com<mailto:greg.h...@rackspace.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: Spark on YARN question Hi Greg, You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends on through the --jars argument if you are using spark-submit / spark-shell / pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV on the driver node (where your application is submitted) to specify environment variables to be observed by your executors. If you are using the spark-submit / spark-shell / pyspark scripts, then you can set Spark properties in the conf/spark-defaults.conf properties file, and these will be propagated to the executors. In other words, configurations on the slave nodes don't do anything. For example, $ vim conf/spark-defaults.conf // set a few properties $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2 $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2 Best, -Andrew