Thanks.  That sounds like how I was thinking it worked.  I did have to install 
the JARs on the slave nodes for yarn-cluster mode to work, FWIW.  It's probably 
just whichever node ends up spawning the application master that needs it, but 
it wasn't passed along from spark-submit.

Greg

From: Andrew Or <and...@databricks.com<mailto:and...@databricks.com>>
Date: Tuesday, September 2, 2014 11:05 AM
To: Matt Narrell <matt.narr...@gmail.com<mailto:matt.narr...@gmail.com>>
Cc: Greg <greg.h...@rackspace.com<mailto:greg.h...@rackspace.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Spark on YARN question

Hi Greg,

You should not need to even manually install Spark on each of the worker nodes 
or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. 
the assembly + additional jars) to each of the containers for you. You can 
specify additional jars that your application depends on through the --jars 
argument if you are using spark-submit / spark-shell / pyspark. As for 
environment variables, you can specify SPARK_YARN_USER_ENV on the driver node 
(where your application is submitted) to specify environment variables to be 
observed by your executors. If you are using the spark-submit / spark-shell / 
pyspark scripts, then you can set Spark properties in the 
conf/spark-defaults.conf properties file, and these will be propagated to the 
executors. In other words, configurations on the slave nodes don't do anything.

For example,
$ vim conf/spark-defaults.conf // set a few properties
$ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
$ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2

Best,
-Andrew

Reply via email to