Re: clarification for some spark on yarn configuration options

2014-09-23 Thread Greg Hill
Thanks for looking into it.  I'm trying to avoid making the user pass in any 
parameters by configuring it to use the right values for the cluster size by 
default, hence my reliance on the configuration.  I'd rather just use 
spark-defaults.conf than the environment variables, and looking at the code you 
modified, I don't see any place it's picking up spark.driver.memory either.  Is 
that a separate bug?

Greg


From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Monday, September 22, 2014 8:11 PM
To: Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com
Cc: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not actually 
picked up in cluster mode. This is a bug and I have opened a PR to fix it: 
https://github.com/apache/spark/pull/2500.
For now, please use --driver-memory instead, which should work for both client 
and cluster mode.

Thanks for pointing this out,
-Andrew

2014-09-22 14:04 GMT-07:00 Nishkam Ravi 
nr...@cloudera.commailto:nr...@cloudera.com:
Maybe try --driver-memory if you are using spark-submit?

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill 
greg.h...@rackspace.commailto:greg.h...@rackspace.com wrote:
Ah, I see.  It turns out that my problem is that that comparison is ignoring 
SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that a bug that's 
since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the master.  
'yarn-client' seems to pick up the values and works fine.

Greg

From: Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com
Date: Monday, September 22, 2014 3:30 PM
To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Cc: Andrew Or and...@databricks.commailto:and...@databricks.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org

Subject: Re: clarification for some spark on yarn configuration options

Greg, if you look carefully, the code is enforcing that the memoryOverhead be 
lower (and not higher) than spark.driver.memory.

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill 
greg.h...@rackspace.commailto:greg.h...@rackspace.com wrote:
I thought I had this all figured out, but I'm getting some weird errors now 
that I'm attempting to deploy this on production-size servers.  It's 
complaining that I'm not allocating enough memory to the memoryOverhead values. 
 I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set 
spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but 
that makes no sense to me since that memory is just supposed to be what YARN 
needs on top of what you're allocating for Spark.  My understanding was that 
the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory 
overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The 
equivalent spark.executor.instances is just another way to set the same thing 
in your spark-defaults.conf. Maybe this should be documented. :)

spark.yarn.executor.memoryOverhead is just an additional margin added to 
spark.executor.memory for the container. In addition to the executor's 
memory, the container in which the executor is launched needs some extra memory 
for system processes, and this is what this overhead (somewhat of a misnomer) 
is for. The same goes for the driver equivalent.

spark.driver.memory behaves differently depending on which version of Spark 
you are using. If you are using Spark 1.1+ (this was released very recently), 
you can directly set spark.driver.memory and this will take effect. 
Otherwise, setting this doesn't actually do anything for client deploy mode, 
and you have two alternatives: (1) set the environment variable equivalent 
SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or 
bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the 
--driver-memory command line argument.

If you want your PySpark application (driver) to pick up extra class path, you 
can pass the --driver-class-path to Spark submit. If you are using Spark 
1.1+, you

Re: clarification for some spark on yarn configuration options

2014-09-23 Thread Andrew Or
Yes... good find. I have filed a JIRA here:
https://issues.apache.org/jira/browse/SPARK-3661 and will get to fixing it
shortly. Both of these fixes will be available in 1.1.1. Until both of
these are merged in, it appears that the only way you can do it now is
through --driver-memory.

-Andrew

2014-09-23 7:23 GMT-07:00 Greg Hill greg.h...@rackspace.com:

  Thanks for looking into it.  I'm trying to avoid making the user pass in
 any parameters by configuring it to use the right values for the cluster
 size by default, hence my reliance on the configuration.  I'd rather just
 use spark-defaults.conf than the environment variables, and looking at the
 code you modified, I don't see any place it's picking up
 spark.driver.memory either.  Is that a separate bug?

  Greg


   From: Andrew Or and...@databricks.com
 Date: Monday, September 22, 2014 8:11 PM
 To: Nishkam Ravi nr...@cloudera.com
 Cc: Greg greg.h...@rackspace.com, user@spark.apache.org 
 user@spark.apache.org

 Subject: Re: clarification for some spark on yarn configuration options

   Hi Greg,

  From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not
 actually picked up in cluster mode. This is a bug and I have opened a PR to
 fix it: https://github.com/apache/spark/pull/2500.
 For now, please use --driver-memory instead, which should work for both
 client and cluster mode.

  Thanks for pointing this out,
 -Andrew

 2014-09-22 14:04 GMT-07:00 Nishkam Ravi nr...@cloudera.com:

 Maybe try --driver-memory if you are using spark-submit?

  Thanks,
 Nishkam

 On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill greg.h...@rackspace.com
 wrote:

  Ah, I see.  It turns out that my problem is that that comparison is
 ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that
 a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the
 master.  'yarn-client' seems to pick up the values and works fine.

  Greg

   From: Nishkam Ravi nr...@cloudera.com
 Date: Monday, September 22, 2014 3:30 PM
 To: Greg greg.h...@rackspace.com
 Cc: Andrew Or and...@databricks.com, user@spark.apache.org 
 user@spark.apache.org

 Subject: Re: clarification for some spark on yarn configuration options

   Greg, if you look carefully, the code is enforcing that the
 memoryOverhead be lower (and not higher) than spark.driver.memory.

  Thanks,
 Nishkam

 On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill greg.h...@rackspace.com
 wrote:

  I thought I had this all figured out, but I'm getting some weird
 errors now that I'm attempting to deploy this on production-size servers.
 It's complaining that I'm not allocating enough memory to the
 memoryOverhead values.  I tracked it down to this code:


 https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

  Unless I'm reading it wrong, those checks are enforcing that you set
 spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
 that makes no sense to me since that memory is just supposed to be what
 YARN needs on top of what you're allocating for Spark.  My understanding
 was that the overhead values should be quite a bit lower (and by default
 they are).

  Also, why must the executor be allocated less memory than the
 driver's memory overhead value?

  What am I misunderstanding here?

  Greg

   From: Andrew Or and...@databricks.com
 Date: Tuesday, September 9, 2014 5:49 PM
 To: Greg greg.h...@rackspace.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: clarification for some spark on yarn configuration options

   Hi Greg,

  SPARK_EXECUTOR_INSTANCES is the total number of workers in the
 cluster. The equivalent spark.executor.instances is just another way to
 set the same thing in your spark-defaults.conf. Maybe this should be
 documented. :)

  spark.yarn.executor.memoryOverhead is just an additional margin
 added to spark.executor.memory for the container. In addition to the
 executor's memory, the container in which the executor is launched needs
 some extra memory for system processes, and this is what this overhead
 (somewhat of a misnomer) is for. The same goes for the driver equivalent.

  spark.driver.memory behaves differently depending on which version
 of Spark you are using. If you are using Spark 1.1+ (this was released very
 recently), you can directly set spark.driver.memory and this will take
 effect. Otherwise, setting this doesn't actually do anything for client
 deploy mode, and you have two alternatives: (1) set the environment
 variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
 using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
 bin/spark-submit), pass the --driver-memory command line argument.

  If you want your PySpark application (driver) to pick up extra class
 path, you can pass the --driver-class-path to Spark submit. If you are
 using Spark 1.1+, you may set spark.driver.extraClassPath

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Greg Hill
I thought I had this all figured out, but I'm getting some weird errors now 
that I'm attempting to deploy this on production-size servers.  It's 
complaining that I'm not allocating enough memory to the memoryOverhead values. 
 I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set 
spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but 
that makes no sense to me since that memory is just supposed to be what YARN 
needs on top of what you're allocating for Spark.  My understanding was that 
the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory 
overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The 
equivalent spark.executor.instances is just another way to set the same thing 
in your spark-defaults.conf. Maybe this should be documented. :)

spark.yarn.executor.memoryOverhead is just an additional margin added to 
spark.executor.memory for the container. In addition to the executor's 
memory, the container in which the executor is launched needs some extra memory 
for system processes, and this is what this overhead (somewhat of a misnomer) 
is for. The same goes for the driver equivalent.

spark.driver.memory behaves differently depending on which version of Spark 
you are using. If you are using Spark 1.1+ (this was released very recently), 
you can directly set spark.driver.memory and this will take effect. 
Otherwise, setting this doesn't actually do anything for client deploy mode, 
and you have two alternatives: (1) set the environment variable equivalent 
SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or 
bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the 
--driver-memory command line argument.

If you want your PySpark application (driver) to pick up extra class path, you 
can pass the --driver-class-path to Spark submit. If you are using Spark 
1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. 
There is also an environment variable you could set (SPARK_CLASSPATH), though 
this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill 
greg.h...@rackspace.commailto:greg.h...@rackspace.com:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the 
workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, 
but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and 
spark.executor.memory ?  Same question for the 'driver' variant, but I assume 
it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use 
the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark 
interactive to pick up the yarn class path?  The ones that work for spark-shell 
and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg



Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Nishkam Ravi
Greg, if you look carefully, the code is enforcing that the memoryOverhead
be lower (and not higher) than spark.driver.memory.

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill greg.h...@rackspace.com wrote:

  I thought I had this all figured out, but I'm getting some weird errors
 now that I'm attempting to deploy this on production-size servers.  It's
 complaining that I'm not allocating enough memory to the memoryOverhead
 values.  I tracked it down to this code:


 https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

  Unless I'm reading it wrong, those checks are enforcing that you set
 spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
 that makes no sense to me since that memory is just supposed to be what
 YARN needs on top of what you're allocating for Spark.  My understanding
 was that the overhead values should be quite a bit lower (and by default
 they are).

  Also, why must the executor be allocated less memory than the driver's
 memory overhead value?

  What am I misunderstanding here?

  Greg

   From: Andrew Or and...@databricks.com
 Date: Tuesday, September 9, 2014 5:49 PM
 To: Greg greg.h...@rackspace.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: clarification for some spark on yarn configuration options

   Hi Greg,

  SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster.
 The equivalent spark.executor.instances is just another way to set the
 same thing in your spark-defaults.conf. Maybe this should be documented. :)

  spark.yarn.executor.memoryOverhead is just an additional margin added
 to spark.executor.memory for the container. In addition to the executor's
 memory, the container in which the executor is launched needs some extra
 memory for system processes, and this is what this overhead (somewhat of
 a misnomer) is for. The same goes for the driver equivalent.

  spark.driver.memory behaves differently depending on which version of
 Spark you are using. If you are using Spark 1.1+ (this was released very
 recently), you can directly set spark.driver.memory and this will take
 effect. Otherwise, setting this doesn't actually do anything for client
 deploy mode, and you have two alternatives: (1) set the environment
 variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
 using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
 bin/spark-submit), pass the --driver-memory command line argument.

  If you want your PySpark application (driver) to pick up extra class
 path, you can pass the --driver-class-path to Spark submit. If you are
 using Spark 1.1+, you may set spark.driver.extraClassPath in your
 spark-defaults.conf. There is also an environment variable you could set
 (SPARK_CLASSPATH), though this is now deprecated.

  Let me know if you have more questions about these options,
 -Andrew


 2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.com:

  Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster
 or the workers per slave node?

  Is spark.executor.instances an actual config option?  I found that in a
 commit, but it's not in the docs.

  What is the difference between spark.yarn.executor.memoryOverhead and 
 spark.executor.memory
 ?  Same question for the 'driver' variant, but I assume it's the same
 answer.

  Is there a spark.driver.memory option that's undocumented or do you
 have to use the environment variable SPARK_DRIVER_MEMORY?

  What config option or environment variable do I need to set to get
 pyspark interactive to pick up the yarn class path?  The ones that work for
 spark-shell and spark-submit don't seem to work for pyspark.

  Thanks in advance.

  Greg





Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Greg Hill
Gah, ignore me again.  I was reading the logic backwards.  For some reason it 
isn't picking up my SPARK_DRIVER_MEMORY environment variable and is using the 
default of 512m.  Probably an environmental issue.

Greg

From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Date: Monday, September 22, 2014 3:26 PM
To: Andrew Or and...@databricks.commailto:and...@databricks.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

I thought I had this all figured out, but I'm getting some weird errors now 
that I'm attempting to deploy this on production-size servers.  It's 
complaining that I'm not allocating enough memory to the memoryOverhead values. 
 I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set 
spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but 
that makes no sense to me since that memory is just supposed to be what YARN 
needs on top of what you're allocating for Spark.  My understanding was that 
the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory 
overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The 
equivalent spark.executor.instances is just another way to set the same thing 
in your spark-defaults.conf. Maybe this should be documented. :)

spark.yarn.executor.memoryOverhead is just an additional margin added to 
spark.executor.memory for the container. In addition to the executor's 
memory, the container in which the executor is launched needs some extra memory 
for system processes, and this is what this overhead (somewhat of a misnomer) 
is for. The same goes for the driver equivalent.

spark.driver.memory behaves differently depending on which version of Spark 
you are using. If you are using Spark 1.1+ (this was released very recently), 
you can directly set spark.driver.memory and this will take effect. 
Otherwise, setting this doesn't actually do anything for client deploy mode, 
and you have two alternatives: (1) set the environment variable equivalent 
SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or 
bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the 
--driver-memory command line argument.

If you want your PySpark application (driver) to pick up extra class path, you 
can pass the --driver-class-path to Spark submit. If you are using Spark 
1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. 
There is also an environment variable you could set (SPARK_CLASSPATH), though 
this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill 
greg.h...@rackspace.commailto:greg.h...@rackspace.com:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the 
workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, 
but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and 
spark.executor.memory ?  Same question for the 'driver' variant, but I assume 
it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use 
the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark 
interactive to pick up the yarn class path?  The ones that work for spark-shell 
and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg



Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Nishkam Ravi
Maybe try --driver-memory if you are using spark-submit?

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill greg.h...@rackspace.com wrote:

  Ah, I see.  It turns out that my problem is that that comparison is
 ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that
 a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the
 master.  'yarn-client' seems to pick up the values and works fine.

  Greg

   From: Nishkam Ravi nr...@cloudera.com
 Date: Monday, September 22, 2014 3:30 PM
 To: Greg greg.h...@rackspace.com
 Cc: Andrew Or and...@databricks.com, user@spark.apache.org 
 user@spark.apache.org

 Subject: Re: clarification for some spark on yarn configuration options

   Greg, if you look carefully, the code is enforcing that the
 memoryOverhead be lower (and not higher) than spark.driver.memory.

  Thanks,
 Nishkam

 On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill greg.h...@rackspace.com
 wrote:

  I thought I had this all figured out, but I'm getting some weird errors
 now that I'm attempting to deploy this on production-size servers.  It's
 complaining that I'm not allocating enough memory to the memoryOverhead
 values.  I tracked it down to this code:


 https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

  Unless I'm reading it wrong, those checks are enforcing that you set
 spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
 that makes no sense to me since that memory is just supposed to be what
 YARN needs on top of what you're allocating for Spark.  My understanding
 was that the overhead values should be quite a bit lower (and by default
 they are).

  Also, why must the executor be allocated less memory than the driver's
 memory overhead value?

  What am I misunderstanding here?

  Greg

   From: Andrew Or and...@databricks.com
 Date: Tuesday, September 9, 2014 5:49 PM
 To: Greg greg.h...@rackspace.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: clarification for some spark on yarn configuration options

   Hi Greg,

  SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster.
 The equivalent spark.executor.instances is just another way to set the
 same thing in your spark-defaults.conf. Maybe this should be documented. :)

  spark.yarn.executor.memoryOverhead is just an additional margin added
 to spark.executor.memory for the container. In addition to the executor's
 memory, the container in which the executor is launched needs some extra
 memory for system processes, and this is what this overhead (somewhat of
 a misnomer) is for. The same goes for the driver equivalent.

  spark.driver.memory behaves differently depending on which version of
 Spark you are using. If you are using Spark 1.1+ (this was released very
 recently), you can directly set spark.driver.memory and this will take
 effect. Otherwise, setting this doesn't actually do anything for client
 deploy mode, and you have two alternatives: (1) set the environment
 variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
 using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
 bin/spark-submit), pass the --driver-memory command line argument.

  If you want your PySpark application (driver) to pick up extra class
 path, you can pass the --driver-class-path to Spark submit. If you are
 using Spark 1.1+, you may set spark.driver.extraClassPath in your
 spark-defaults.conf. There is also an environment variable you could set
 (SPARK_CLASSPATH), though this is now deprecated.

  Let me know if you have more questions about these options,
 -Andrew


 2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.com:

  Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster
 or the workers per slave node?

  Is spark.executor.instances an actual config option?  I found that in
 a commit, but it's not in the docs.

  What is the difference between spark.yarn.executor.memoryOverhead and 
 spark.executor.memory
 ?  Same question for the 'driver' variant, but I assume it's the same
 answer.

  Is there a spark.driver.memory option that's undocumented or do you
 have to use the environment variable SPARK_DRIVER_MEMORY?

  What config option or environment variable do I need to set to get
 pyspark interactive to pick up the yarn class path?  The ones that work for
 spark-shell and spark-submit don't seem to work for pyspark.

  Thanks in advance.

  Greg






Re: clarification for some spark on yarn configuration options

2014-09-09 Thread Andrew Or
Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The
equivalent spark.executor.instances is just another way to set the same
thing in your spark-defaults.conf. Maybe this should be documented. :)

spark.yarn.executor.memoryOverhead is just an additional margin added to
spark.executor.memory for the container. In addition to the executor's
memory, the container in which the executor is launched needs some extra
memory for system processes, and this is what this overhead (somewhat of
a misnomer) is for. The same goes for the driver equivalent.

spark.driver.memory behaves differently depending on which version of
Spark you are using. If you are using Spark 1.1+ (this was released very
recently), you can directly set spark.driver.memory and this will take
effect. Otherwise, setting this doesn't actually do anything for client
deploy mode, and you have two alternatives: (1) set the environment
variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
bin/spark-submit), pass the --driver-memory command line argument.

If you want your PySpark application (driver) to pick up extra class path,
you can pass the --driver-class-path to Spark submit. If you are using
Spark 1.1+, you may set spark.driver.extraClassPath in your
spark-defaults.conf. There is also an environment variable you could set
(SPARK_CLASSPATH), though this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.com:

  Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster
 or the workers per slave node?

  Is spark.executor.instances an actual config option?  I found that in a
 commit, but it's not in the docs.

  What is the difference between spark.yarn.executor.memoryOverhead and 
 spark.executor.memory
 ?  Same question for the 'driver' variant, but I assume it's the same
 answer.

  Is there a spark.driver.memory option that's undocumented or do you have
 to use the environment variable SPARK_DRIVER_MEMORY?

  What config option or environment variable do I need to set to get
 pyspark interactive to pick up the yarn class path?  The ones that work for
 spark-shell and spark-submit don't seem to work for pyspark.

  Thanks in advance.

  Greg