Re: clarification for some spark on yarn configuration options
Thanks for looking into it. I'm trying to avoid making the user pass in any parameters by configuring it to use the right values for the cluster size by default, hence my reliance on the configuration. I'd rather just use spark-defaults.conf than the environment variables, and looking at the code you modified, I don't see any place it's picking up spark.driver.memory either. Is that a separate bug? Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Monday, September 22, 2014 8:11 PM To: Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com Cc: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not actually picked up in cluster mode. This is a bug and I have opened a PR to fix it: https://github.com/apache/spark/pull/2500. For now, please use --driver-memory instead, which should work for both client and cluster mode. Thanks for pointing this out, -Andrew 2014-09-22 14:04 GMT-07:00 Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com: Maybe try --driver-memory if you are using spark-submit? Thanks, Nishkam On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill greg.h...@rackspace.commailto:greg.h...@rackspace.com wrote: Ah, I see. It turns out that my problem is that that comparison is ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m. Is that a bug that's since fixed? I'm on 1.0.1 and using 'yarn-cluster' as the master. 'yarn-client' seems to pick up the values and works fine. Greg From: Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com Date: Monday, September 22, 2014 3:30 PM To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Cc: Andrew Or and...@databricks.commailto:and...@databricks.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory. Thanks, Nishkam On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill greg.h...@rackspace.commailto:greg.h...@rackspace.com wrote: I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark. My understanding was that the overhead values should be quite a bit lower (and by default they are). Also, why must the executor be allocated less memory than the driver's memory overhead value? What am I misunderstanding here? Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Tuesday, September 9, 2014 5:49 PM To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you
Re: clarification for some spark on yarn configuration options
Yes... good find. I have filed a JIRA here: https://issues.apache.org/jira/browse/SPARK-3661 and will get to fixing it shortly. Both of these fixes will be available in 1.1.1. Until both of these are merged in, it appears that the only way you can do it now is through --driver-memory. -Andrew 2014-09-23 7:23 GMT-07:00 Greg Hill greg.h...@rackspace.com: Thanks for looking into it. I'm trying to avoid making the user pass in any parameters by configuring it to use the right values for the cluster size by default, hence my reliance on the configuration. I'd rather just use spark-defaults.conf than the environment variables, and looking at the code you modified, I don't see any place it's picking up spark.driver.memory either. Is that a separate bug? Greg From: Andrew Or and...@databricks.com Date: Monday, September 22, 2014 8:11 PM To: Nishkam Ravi nr...@cloudera.com Cc: Greg greg.h...@rackspace.com, user@spark.apache.org user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not actually picked up in cluster mode. This is a bug and I have opened a PR to fix it: https://github.com/apache/spark/pull/2500. For now, please use --driver-memory instead, which should work for both client and cluster mode. Thanks for pointing this out, -Andrew 2014-09-22 14:04 GMT-07:00 Nishkam Ravi nr...@cloudera.com: Maybe try --driver-memory if you are using spark-submit? Thanks, Nishkam On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill greg.h...@rackspace.com wrote: Ah, I see. It turns out that my problem is that that comparison is ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m. Is that a bug that's since fixed? I'm on 1.0.1 and using 'yarn-cluster' as the master. 'yarn-client' seems to pick up the values and works fine. Greg From: Nishkam Ravi nr...@cloudera.com Date: Monday, September 22, 2014 3:30 PM To: Greg greg.h...@rackspace.com Cc: Andrew Or and...@databricks.com, user@spark.apache.org user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory. Thanks, Nishkam On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill greg.h...@rackspace.com wrote: I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark. My understanding was that the overhead values should be quite a bit lower (and by default they are). Also, why must the executor be allocated less memory than the driver's memory overhead value? What am I misunderstanding here? Greg From: Andrew Or and...@databricks.com Date: Tuesday, September 9, 2014 5:49 PM To: Greg greg.h...@rackspace.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you may set spark.driver.extraClassPath
Re: clarification for some spark on yarn configuration options
I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark. My understanding was that the overhead values should be quite a bit lower (and by default they are). Also, why must the executor be allocated less memory than the driver's memory overhead value? What am I misunderstanding here? Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Tuesday, September 9, 2014 5:49 PM To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated. Let me know if you have more questions about these options, -Andrew 2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.commailto:greg.h...@rackspace.com: Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node? Is spark.executor.instances an actual config option? I found that in a commit, but it's not in the docs. What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ? Same question for the 'driver' variant, but I assume it's the same answer. Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY? What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path? The ones that work for spark-shell and spark-submit don't seem to work for pyspark. Thanks in advance. Greg
Re: clarification for some spark on yarn configuration options
Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory. Thanks, Nishkam On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill greg.h...@rackspace.com wrote: I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark. My understanding was that the overhead values should be quite a bit lower (and by default they are). Also, why must the executor be allocated less memory than the driver's memory overhead value? What am I misunderstanding here? Greg From: Andrew Or and...@databricks.com Date: Tuesday, September 9, 2014 5:49 PM To: Greg greg.h...@rackspace.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated. Let me know if you have more questions about these options, -Andrew 2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.com: Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node? Is spark.executor.instances an actual config option? I found that in a commit, but it's not in the docs. What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ? Same question for the 'driver' variant, but I assume it's the same answer. Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY? What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path? The ones that work for spark-shell and spark-submit don't seem to work for pyspark. Thanks in advance. Greg
Re: clarification for some spark on yarn configuration options
Gah, ignore me again. I was reading the logic backwards. For some reason it isn't picking up my SPARK_DRIVER_MEMORY environment variable and is using the default of 512m. Probably an environmental issue. Greg From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Date: Monday, September 22, 2014 3:26 PM To: Andrew Or and...@databricks.commailto:and...@databricks.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark. My understanding was that the overhead values should be quite a bit lower (and by default they are). Also, why must the executor be allocated less memory than the driver's memory overhead value? What am I misunderstanding here? Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Tuesday, September 9, 2014 5:49 PM To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated. Let me know if you have more questions about these options, -Andrew 2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.commailto:greg.h...@rackspace.com: Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node? Is spark.executor.instances an actual config option? I found that in a commit, but it's not in the docs. What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ? Same question for the 'driver' variant, but I assume it's the same answer. Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY? What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path? The ones that work for spark-shell and spark-submit don't seem to work for pyspark. Thanks in advance. Greg
Re: clarification for some spark on yarn configuration options
Maybe try --driver-memory if you are using spark-submit? Thanks, Nishkam On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill greg.h...@rackspace.com wrote: Ah, I see. It turns out that my problem is that that comparison is ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m. Is that a bug that's since fixed? I'm on 1.0.1 and using 'yarn-cluster' as the master. 'yarn-client' seems to pick up the values and works fine. Greg From: Nishkam Ravi nr...@cloudera.com Date: Monday, September 22, 2014 3:30 PM To: Greg greg.h...@rackspace.com Cc: Andrew Or and...@databricks.com, user@spark.apache.org user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory. Thanks, Nishkam On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill greg.h...@rackspace.com wrote: I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark. My understanding was that the overhead values should be quite a bit lower (and by default they are). Also, why must the executor be allocated less memory than the driver's memory overhead value? What am I misunderstanding here? Greg From: Andrew Or and...@databricks.com Date: Tuesday, September 9, 2014 5:49 PM To: Greg greg.h...@rackspace.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated. Let me know if you have more questions about these options, -Andrew 2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.com: Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node? Is spark.executor.instances an actual config option? I found that in a commit, but it's not in the docs. What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ? Same question for the 'driver' variant, but I assume it's the same answer. Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY? What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path? The ones that work for spark-shell and spark-submit don't seem to work for pyspark. Thanks in advance. Greg
Re: clarification for some spark on yarn configuration options
Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated. Let me know if you have more questions about these options, -Andrew 2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.com: Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node? Is spark.executor.instances an actual config option? I found that in a commit, but it's not in the docs. What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ? Same question for the 'driver' variant, but I assume it's the same answer. Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY? What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path? The ones that work for spark-shell and spark-submit don't seem to work for pyspark. Thanks in advance. Greg