Re: Spark on YARN question

2014-09-02 Thread Matt Narrell
I’ve put my Spark JAR into HDFS, and specify the SPARK_JAR variable to point to 
the HDFS location of the jar.  I’m not using any specialized configuration 
files (like spark-env.sh), but rather setting things either by environment 
variable per node, passing application arguments to the job, or making a 
Zookeeper connection from my job to seed properties.  From there, I can 
construct a SparkConf as necessary.

mn

On Sep 2, 2014, at 9:06 AM, Greg Hill greg.h...@rackspace.com wrote:

 I'm working on setting up Spark on YARN using the HDP technical preview - 
 http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/
 
 I have installed the Spark JARs on all the slave nodes and configured YARN to 
 find the JARs.  It seems like everything is working.
 
 Unless I'm misunderstanding, it seems like there isn't any configuration 
 required on the YARN slave nodes at all, apart from telling YARN where to 
 find the Spark JAR files.  Do the YARN processes even pick up local Spark 
 configuration files on the slave nodes, or is that all just pulled in on the 
 client and passed along to YARN?
 
 Greg



Re: Spark on YARN question

2014-09-02 Thread Andrew Or
Hi Greg,

You should not need to even manually install Spark on each of the worker
nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary
jars (i.e. the assembly + additional jars) to each of the containers for
you. You can specify additional jars that your application depends on
through the --jars argument if you are using spark-submit / spark-shell /
pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV
on the driver node (where your application is submitted) to specify
environment variables to be observed by your executors. If you are using
the spark-submit / spark-shell / pyspark scripts, then you can set Spark
properties in the conf/spark-defaults.conf properties file, and these will
be propagated to the executors. In other words, configurations on the slave
nodes don't do anything.

For example,
$ vim conf/spark-defaults.conf // set a few properties
$ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
$ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2

Best,
-Andrew


Re: Spark on YARN question

2014-09-02 Thread Greg Hill
Thanks.  That sounds like how I was thinking it worked.  I did have to install 
the JARs on the slave nodes for yarn-cluster mode to work, FWIW.  It's probably 
just whichever node ends up spawning the application master that needs it, but 
it wasn't passed along from spark-submit.

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Tuesday, September 2, 2014 11:05 AM
To: Matt Narrell matt.narr...@gmail.commailto:matt.narr...@gmail.com
Cc: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark on YARN question

Hi Greg,

You should not need to even manually install Spark on each of the worker nodes 
or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. 
the assembly + additional jars) to each of the containers for you. You can 
specify additional jars that your application depends on through the --jars 
argument if you are using spark-submit / spark-shell / pyspark. As for 
environment variables, you can specify SPARK_YARN_USER_ENV on the driver node 
(where your application is submitted) to specify environment variables to be 
observed by your executors. If you are using the spark-submit / spark-shell / 
pyspark scripts, then you can set Spark properties in the 
conf/spark-defaults.conf properties file, and these will be propagated to the 
executors. In other words, configurations on the slave nodes don't do anything.

For example,
$ vim conf/spark-defaults.conf // set a few properties
$ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
$ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2

Best,
-Andrew


Re: Spark on YARN question

2014-09-02 Thread Dimension Data, LLC.

Hello friends:

I have a follow-up to Andrew's well articulated answer below (thank you 
for that).


(1) I've seen both of these invocations in various places:

  (a) '--master yarn'
  (b) '--master yarn-client'

the latter of which doesn't appear in 
'/pyspark//|//spark-submit|spark-shell --help/' output.


Is case (a) meant for cluster-mode apps (where the driver is out on 
a YARN ApplicationMaster,

and case (b) for client-mode apps needing client interaction locally?

Also (related), is case (b) simply shorthand for the following 
invocation syntax?

   '--master yarn --deploy-mode client'

(2) Seeking clarification on the first sentence below...

/Note: To avoid a copy of the Assembly JAR every time I launch a 
job, I place it (the lat//est//
//version) at a specific (but otherwise arbitrary) location on HDFS, 
and then set SPARK_JAR,

like so (//where you can thankfully use wild-cards//)//://
//
//   export 
SPARK_JAR=hdfs://namenode:8020///path/to///spark-assembly-*.jar/


But my question here is, when specifying additional JARS like this 
'--jars /path/to/jar1,/path/to/jar2,...'
to /pyspark|spark-submit|spark-shell/ commands, are those JARS 
expected to *already* be
at those path locations on both the _submitter_ server, as well as 
on YARN _worker_ servers?


In other words, the '--jars' option won't cause the command to look 
for them locally at those path
locations, and then ship  place them to the same path-locations 
remotely? They need to be there

already, both locally and remotely. Correct?

Thank you. :)
didata


On 09/02/2014 12:05 PM, Andrew Or wrote:

Hi Greg,

You should not need to even manually install Spark on each of the 
worker nodes or put it into HDFS yourself. Spark on Yarn will ship all 
necessary jars (i.e. the assembly + additional jars) to each of the 
containers for you. You can specify additional jars that your 
application depends on through the --jars argument if you are using 
spark-submit / spark-shell / pyspark. As for environment variables, 
you can specify SPARK_YARN_USER_ENV on the driver node (where your 
application is submitted) to specify environment variables to be 
observed by your executors. If you are using the spark-submit / 
spark-shell / pyspark scripts, then you can set Spark properties in 
the conf/spark-defaults.conf properties file, and these will be 
propagated to the executors. In other words, configurations on the 
slave nodes don't do anything.


For example,
$ vim conf/spark-defaults.conf // set a few properties
$ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
$ bin/spark-shell --master yarn --jars 
/local/path/to/my/jar1,/another/jar2


Best,
-Andrew


Re: Spark on YARN question

2014-09-02 Thread Andrew Or
Hi Didata,

(1) Correct. The default deploy mode is `client`, so both masters `yarn`
and `yarn-client` run Spark in client mode. If you explicitly specify
master as `yarn-cluster`, Spark will run in cluster mode. If you implicitly
specify one deploy mode through the master (e.g. yarn-client) but set
deploy mode to the opposite (e.g. cluster), Spark will complain and throw
an exception. :)

(2) The jars passed through the `--jars` option only need to be visible to
the spark-submit program. Depending on the deploy mode, they will be
propagated to the containers (i.e. the executors, and the driver in cluster
mode) differently so you don't need to manually copy them yourself, either
through rsync'ing or uploading to HDFS. Another thing is that SPARK_JAR
is technically deprecated (you should get a warning for using it). Instead,
you can set spark.yarn.jar in your conf/spark-defaults.conf on the
submitter node.

Let me know if you have more questions,
-Andrew


2014-09-02 15:12 GMT-07:00 Dimension Data, LLC. subscripti...@didata.us:

  Hello friends:

 I have a follow-up to Andrew's well articulated answer below (thank you
 for that).

 (1) I've seen both of these invocations in various places:

   (a) '--master yarn'
   (b) '--master yarn-client'

 the latter of which doesn't appear in 
 '*pyspark**|**spark-submit|spark-shell
 --help*' output.

 Is case (a) meant for cluster-mode apps (where the driver is out on a
 YARN ApplicationMaster,
 and case (b) for client-mode apps needing client interaction locally?

 Also (related), is case (b) simply shorthand for the following
 invocation syntax?
'--master yarn --deploy-mode client'

 (2) Seeking clarification on the first sentence below...

 *Note: To avoid a copy of the Assembly JAR every time I launch a job,
 I place it (the lat**est*

 *version) at a specific (but otherwise arbitrary) location on HDFS,
 and then set SPARK_JAR, like so (**where you can thankfully use
 wild-cards**)**:*

 *   export SPARK_JAR=hdfs://namenode:8020/**path/to*
 */spark-assembly-*.jar*

 But my question here is, when specifying additional JARS like this
 '--jars /path/to/jar1,/path/to/jar2,...'
 to *pyspark|spark-submit|spark-shell* commands, are those JARS
 expected to *already* be
 at those path locations on both the _submitter_ server, as well as on
 YARN _worker_ servers?

 In other words, the '--jars' option won't cause the command to look
 for them locally at those path
 locations, and then ship  place them to the same path-locations
 remotely? They need to be there
 already, both locally and remotely. Correct?

 Thank you. :)
 didata


  On 09/02/2014 12:05 PM, Andrew Or wrote:

 Hi Greg,

  You should not need to even manually install Spark on each of the worker
 nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary
 jars (i.e. the assembly + additional jars) to each of the containers for
 you. You can specify additional jars that your application depends on
 through the --jars argument if you are using spark-submit / spark-shell /
 pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV
 on the driver node (where your application is submitted) to specify
 environment variables to be observed by your executors. If you are using
 the spark-submit / spark-shell / pyspark scripts, then you can set Spark
 properties in the conf/spark-defaults.conf properties file, and these will
 be propagated to the executors. In other words, configurations on the slave
 nodes don't do anything.

  For example,
 $ vim conf/spark-defaults.conf // set a few properties
 $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
 $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2

  Best,
 -Andrew