I thought I had this all figured out, but I'm getting some weird errors now 
that I'm attempting to deploy this on production-size servers.  It's 
complaining that I'm not allocating enough memory to the memoryOverhead values. 
 I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set 
spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but 
that makes no sense to me since that memory is just supposed to be what YARN 
needs on top of what you're allocating for Spark.  My understanding was that 
the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory 
overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or <and...@databricks.com<mailto:and...@databricks.com>>
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg <greg.h...@rackspace.com<mailto:greg.h...@rackspace.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The 
equivalent "spark.executor.instances" is just another way to set the same thing 
in your spark-defaults.conf. Maybe this should be documented. :)

"spark.yarn.executor.memoryOverhead" is just an additional margin added to 
"spark.executor.memory" for the container. In addition to the executor's 
memory, the container in which the executor is launched needs some extra memory 
for system processes, and this is what this "overhead" (somewhat of a misnomer) 
is for. The same goes for the driver equivalent.

"spark.driver.memory" behaves differently depending on which version of Spark 
you are using. If you are using Spark 1.1+ (this was released very recently), 
you can directly set "spark.driver.memory" and this will take effect. 
Otherwise, setting this doesn't actually do anything for client deploy mode, 
and you have two alternatives: (1) set the environment variable equivalent 
SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or 
bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the 
"--driver-memory" command line argument.

If you want your PySpark application (driver) to pick up extra class path, you 
can pass the "--driver-class-path" to Spark submit. If you are using Spark 
1.1+, you may set "spark.driver.extraClassPath" in your spark-defaults.conf. 
There is also an environment variable you could set (SPARK_CLASSPATH), though 
this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill 
<greg.h...@rackspace.com<mailto:greg.h...@rackspace.com>>:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the 
workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, 
but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and 
spark.executor.memory ?  Same question for the 'driver' variant, but I assume 
it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use 
the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark 
interactive to pick up the yarn class path?  The ones that work for spark-shell 
and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg

Reply via email to