Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Thank you Nishkam,
I have read your code. So, for the sake of my understanding, it seems that
for each spark context there is one executor per node? Can anyone confirm
this?


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi nr...@cloudera.com wrote:

 See if this helps:

 https://github.com/nishkamravi2/SparkAutoConfig/

 It's a very simple tool for auto-configuring default parameters in Spark.
 Takes as input high-level parameters (like number of nodes, cores per node,
 memory per node, etc) and spits out default configuration, user advice and
 command line. Compile (javac SparkConfigure.java) and run (java
 SparkConfigure).

 Also cc'ing dev in case others are interested in helping evolve this over
 time (by refining the heuristics and adding more parameters).


  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thanks Andrew,

 So if there is only one SparkContext there is only one executor per
 machine? This seems to contradict Aaron's message from the link above:

 If each machine has 16 GB of RAM and 4 cores, for example, you might set
 spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by Spark.)

 Am I reading this incorrectly?

 Anyway our configuration is 21 machines (one master and 20 slaves) each
 with 60Gb. We would like to use 4 cores per machine. This is pyspark so we
 want to leave say 16Gb on each machine for python processes.

 Thanks again for the advice!



 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set of
 executors across the cluster.  So for example if you have two shells open,
 they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have
 and how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 We are having difficulties configuring Spark, partly because we still
 don't understand some key concepts. For instance, how many executors are
 there per machine in standalone mode? This is after having closely
 read the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but is
 ambiguous about whether there are single executors or multiple executors on
 each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 *http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

 But other messages imply that the executor memory should be set to the
 *total* available memory of each machine.

 We would very much appreciate some clarity on this and the myriad of
 other memory settings available (daemon memory, worker memory etc). Perhaps
 a worked example could be added to the docs? I would be happy to provide
 some text as soon as someone can enlighten me on the technicalities!

 Thank you

 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]







Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Great - thanks for the clarification Aaron. The offer stands for me to
write some documentation and an example that covers this without leaving
*any* room for ambiguity.




-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Thu, Jul 24, 2014 at 6:09 PM, Aaron Davidson ilike...@gmail.com wrote:

 Whoops, I was mistaken in my original post last year. By default, there is
 one executor per node per Spark Context, as you said.
 spark.executor.memory is the amount of memory that the application
 requests for each of its executors. SPARK_WORKER_MEMORY is the amount of
 memory a Spark Worker is willing to allocate in executors.

 So if you were to set SPARK_WORKER_MEMORY to 8g everywhere on your
 cluster, and spark.executor.memory to 4g, you would be able to run 2
 simultaneous Spark Contexts who get 4g per node. Similarly, if
 spark.executor.memory were 8g, you could only run 1 Spark Context at a time
 on the cluster, but it would get all the cluster's memory.


 On Thu, Jul 24, 2014 at 7:25 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thank you Nishkam,
 I have read your code. So, for the sake of my understanding, it seems
 that for each spark context there is one executor per node? Can anyone
 confirm this?


 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi nr...@cloudera.com wrote:

 See if this helps:

 https://github.com/nishkamravi2/SparkAutoConfig/

 It's a very simple tool for auto-configuring default parameters in
 Spark. Takes as input high-level parameters (like number of nodes, cores
 per node, memory per node, etc) and spits out default configuration, user
 advice and command line. Compile (javac SparkConfigure.java) and run (java
 SparkConfigure).

 Also cc'ing dev in case others are interested in helping evolve this
 over time (by refining the heuristics and adding more parameters).


  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thanks Andrew,

 So if there is only one SparkContext there is only one executor per
 machine? This seems to contradict Aaron's message from the link above:

 If each machine has 16 GB of RAM and 4 cores, for example, you might
 set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
 Spark.)

 Am I reading this incorrectly?

 Anyway our configuration is 21 machines (one master and 20 slaves) each
 with 60Gb. We would like to use 4 cores per machine. This is pyspark so we
 want to leave say 16Gb on each machine for python processes.

 Thanks again for the advice!



 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
 wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set
 of executors across the cluster.  So for example if you have two shells
 open, they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons 
 (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have
 and how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 We are having difficulties configuring Spark, partly because we still
 don't understand some key concepts. For instance, how many executors are
 there per machine in standalone mode? This is after having closely
 read the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but is
 ambiguous about whether there are single executors or multiple executors 
 on
 each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 *http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

 But other messages imply that the executor memory should be set to
 the *total* available memory of each machine.

 We would very much appreciate some clarity on this and the myriad of
 

Re: Configuring Spark Memory

2014-07-24 Thread Aaron Davidson
Whoops, I was mistaken in my original post last year. By default, there is
one executor per node per Spark Context, as you said.
spark.executor.memory is the amount of memory that the application
requests for each of its executors. SPARK_WORKER_MEMORY is the amount of
memory a Spark Worker is willing to allocate in executors.

So if you were to set SPARK_WORKER_MEMORY to 8g everywhere on your cluster,
and spark.executor.memory to 4g, you would be able to run 2 simultaneous
Spark Contexts who get 4g per node. Similarly, if spark.executor.memory
were 8g, you could only run 1 Spark Context at a time on the cluster, but
it would get all the cluster's memory.


On Thu, Jul 24, 2014 at 7:25 AM, Martin Goodson mar...@skimlinks.com
wrote:

 Thank you Nishkam,
 I have read your code. So, for the sake of my understanding, it seems that
 for each spark context there is one executor per node? Can anyone confirm
 this?


 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi nr...@cloudera.com wrote:

 See if this helps:

 https://github.com/nishkamravi2/SparkAutoConfig/

 It's a very simple tool for auto-configuring default parameters in Spark.
 Takes as input high-level parameters (like number of nodes, cores per node,
 memory per node, etc) and spits out default configuration, user advice and
 command line. Compile (javac SparkConfigure.java) and run (java
 SparkConfigure).

 Also cc'ing dev in case others are interested in helping evolve this over
 time (by refining the heuristics and adding more parameters).


  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thanks Andrew,

 So if there is only one SparkContext there is only one executor per
 machine? This seems to contradict Aaron's message from the link above:

 If each machine has 16 GB of RAM and 4 cores, for example, you might
 set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
 Spark.)

 Am I reading this incorrectly?

 Anyway our configuration is 21 machines (one master and 20 slaves) each
 with 60Gb. We would like to use 4 cores per machine. This is pyspark so we
 want to leave say 16Gb on each machine for python processes.

 Thanks again for the advice!



 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
 wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set
 of executors across the cluster.  So for example if you have two shells
 open, they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have
 and how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 We are having difficulties configuring Spark, partly because we still
 don't understand some key concepts. For instance, how many executors are
 there per machine in standalone mode? This is after having closely
 read the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but is
 ambiguous about whether there are single executors or multiple executors 
 on
 each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 *http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

 But other messages imply that the executor memory should be set to the
 *total* available memory of each machine.

 We would very much appreciate some clarity on this and the myriad of
 other memory settings available (daemon memory, worker memory etc). 
 Perhaps
 a worked example could be added to the docs? I would be happy to provide
 some text as soon as someone can enlighten me on the technicalities!

 Thank you

 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]








Re: Configuring Spark Memory

2014-07-24 Thread John Omernik
SO this is good information for standalone, but how is memory distributed
within Mesos?  There's coarse grain mode where the execute stays active, or
theres fine grained mode where it appears each task is it's only process in
mesos, how to memory allocations work in these cases? Thanks!



On Thu, Jul 24, 2014 at 12:14 PM, Martin Goodson mar...@skimlinks.com
wrote:

 Great - thanks for the clarification Aaron. The offer stands for me to
 write some documentation and an example that covers this without leaving
 *any* room for ambiguity.




 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Thu, Jul 24, 2014 at 6:09 PM, Aaron Davidson ilike...@gmail.com
 wrote:

 Whoops, I was mistaken in my original post last year. By default, there
 is one executor per node per Spark Context, as you said.
 spark.executor.memory is the amount of memory that the application
 requests for each of its executors. SPARK_WORKER_MEMORY is the amount of
 memory a Spark Worker is willing to allocate in executors.

 So if you were to set SPARK_WORKER_MEMORY to 8g everywhere on your
 cluster, and spark.executor.memory to 4g, you would be able to run 2
 simultaneous Spark Contexts who get 4g per node. Similarly, if
 spark.executor.memory were 8g, you could only run 1 Spark Context at a time
 on the cluster, but it would get all the cluster's memory.


 On Thu, Jul 24, 2014 at 7:25 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thank you Nishkam,
 I have read your code. So, for the sake of my understanding, it seems
 that for each spark context there is one executor per node? Can anyone
 confirm this?


 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi nr...@cloudera.com
 wrote:

 See if this helps:

 https://github.com/nishkamravi2/SparkAutoConfig/

 It's a very simple tool for auto-configuring default parameters in
 Spark. Takes as input high-level parameters (like number of nodes, cores
 per node, memory per node, etc) and spits out default configuration, user
 advice and command line. Compile (javac SparkConfigure.java) and run (java
 SparkConfigure).

 Also cc'ing dev in case others are interested in helping evolve this
 over time (by refining the heuristics and adding more parameters).


  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thanks Andrew,

 So if there is only one SparkContext there is only one executor per
 machine? This seems to contradict Aaron's message from the link above:

 If each machine has 16 GB of RAM and 4 cores, for example, you might
 set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
 Spark.)

 Am I reading this incorrectly?

 Anyway our configuration is 21 machines (one master and 20 slaves)
 each with 60Gb. We would like to use 4 cores per machine. This is pyspark
 so we want to leave say 16Gb on each machine for python processes.

 Thanks again for the advice!



 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
 wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set
 of executors across the cluster.  So for example if you have two shells
 open, they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons 
 (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have
 and how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
  wrote:

 We are having difficulties configuring Spark, partly because we
 still don't understand some key concepts. For instance, how many 
 executors
 are there per machine in standalone mode? This is after having
 closely read the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but
 is ambiguous about whether there are single executors or multiple 
 executors
 on each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 

Configuring Spark Memory

2014-07-23 Thread Martin Goodson
We are having difficulties configuring Spark, partly because we still don't
understand some key concepts. For instance, how many executors are there
per machine in standalone mode? This is after having closely read the
documentation several times:

*http://spark.apache.org/docs/latest/configuration.html
http://spark.apache.org/docs/latest/configuration.html*
*http://spark.apache.org/docs/latest/spark-standalone.html
http://spark.apache.org/docs/latest/spark-standalone.html*
*http://spark.apache.org/docs/latest/tuning.html
http://spark.apache.org/docs/latest/tuning.html*
*http://spark.apache.org/docs/latest/cluster-overview.html
http://spark.apache.org/docs/latest/cluster-overview.html*

The cluster overview has some information here about executors but is
ambiguous about whether there are single executors or multiple executors on
each machine.

 This message from Aaron Davidson implies that the executor memory should
be set to total available memory on the machine divided by the number of
cores:
*http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

But other messages imply that the executor memory should be set to the
*total* available memory of each machine.

We would very much appreciate some clarity on this and the myriad of other
memory settings available (daemon memory, worker memory etc). Perhaps a
worked example could be added to the docs? I would be happy to provide some
text as soon as someone can enlighten me on the technicalities!

Thank you

-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


Re: Configuring Spark Memory

2014-07-23 Thread Andrew Ash
Hi Martin,

In standalone mode, each SparkContext you initialize gets its own set of
executors across the cluster.  So for example if you have two shells open,
they'll each get two JVMs on each worker machine in the cluster.

As far as the other docs, you can configure the total number of cores
requested for the SparkContext, the amount of memory for the executor JVM
on each machine, the amount of memory for the Master/Worker daemons (little
needed since work is done in executors), and several other settings.

Which of those are you interested in?  What spec hardware do you have and
how do you want to configure it?

Andrew


On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
wrote:

 We are having difficulties configuring Spark, partly because we still
 don't understand some key concepts. For instance, how many executors are
 there per machine in standalone mode? This is after having closely read
 the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but is
 ambiguous about whether there are single executors or multiple executors on
 each machine.

  This message from Aaron Davidson implies that the executor memory should
 be set to total available memory on the machine divided by the number of
 cores:
 *http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

 But other messages imply that the executor memory should be set to the
 *total* available memory of each machine.

 We would very much appreciate some clarity on this and the myriad of other
 memory settings available (daemon memory, worker memory etc). Perhaps a
 worked example could be added to the docs? I would be happy to provide some
 text as soon as someone can enlighten me on the technicalities!

 Thank you

 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]



Re: Configuring Spark Memory

2014-07-23 Thread Sean Owen
On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote:


 In standalone mode, each SparkContext you initialize gets its own set of
 executors across the cluster.  So for example if you have two shells open,
 they'll each get two JVMs on each worker machine in the cluster.


Dumb question offline -- do you mean they'll each get one JVM on each
worker? or if it's two, what drives the two each?


Re: Configuring Spark Memory

2014-07-23 Thread Martin Goodson
Thanks Andrew,

So if there is only one SparkContext there is only one executor per
machine? This seems to contradict Aaron's message from the link above:

If each machine has 16 GB of RAM and 4 cores, for example, you might set
spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by Spark.)

Am I reading this incorrectly?

Anyway our configuration is 21 machines (one master and 20 slaves) each
with 60Gb. We would like to use 4 cores per machine. This is pyspark so we
want to leave say 16Gb on each machine for python processes.

Thanks again for the advice!



-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set of
 executors across the cluster.  So for example if you have two shells open,
 they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have and
 how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 We are having difficulties configuring Spark, partly because we still
 don't understand some key concepts. For instance, how many executors are
 there per machine in standalone mode? This is after having closely read
 the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but is
 ambiguous about whether there are single executors or multiple executors on
 each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 *http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

 But other messages imply that the executor memory should be set to the
 *total* available memory of each machine.

 We would very much appreciate some clarity on this and the myriad of
 other memory settings available (daemon memory, worker memory etc). Perhaps
 a worked example could be added to the docs? I would be happy to provide
 some text as soon as someone can enlighten me on the technicalities!

 Thank you

 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]





Re: Configuring Spark Memory

2014-07-23 Thread Nishkam Ravi
See if this helps:

https://github.com/nishkamravi2/SparkAutoConfig/

It's a very simple tool for auto-configuring default parameters in Spark.
Takes as input high-level parameters (like number of nodes, cores per node,
memory per node, etc) and spits out default configuration, user advice and
command line. Compile (javac SparkConfigure.java) and run (java
SparkConfigure).

Also cc'ing dev in case others are interested in helping evolve this over
time (by refining the heuristics and adding more parameters).


On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson mar...@skimlinks.com
wrote:

 Thanks Andrew,

 So if there is only one SparkContext there is only one executor per
 machine? This seems to contradict Aaron's message from the link above:

 If each machine has 16 GB of RAM and 4 cores, for example, you might set
 spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by Spark.)

 Am I reading this incorrectly?

 Anyway our configuration is 21 machines (one master and 20 slaves) each
 with 60Gb. We would like to use 4 cores per machine. This is pyspark so we
 want to leave say 16Gb on each machine for python processes.

 Thanks again for the advice!



 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set of
 executors across the cluster.  So for example if you have two shells open,
 they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have and
 how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 We are having difficulties configuring Spark, partly because we still
 don't understand some key concepts. For instance, how many executors are
 there per machine in standalone mode? This is after having closely read
 the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but is
 ambiguous about whether there are single executors or multiple executors on
 each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 *http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

 But other messages imply that the executor memory should be set to the
 *total* available memory of each machine.

 We would very much appreciate some clarity on this and the myriad of
 other memory settings available (daemon memory, worker memory etc). Perhaps
 a worked example could be added to the docs? I would be happy to provide
 some text as soon as someone can enlighten me on the technicalities!

 Thank you

 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]