Re: Tuning number of partitions per CPU

2015-02-17 Thread Sean Owen
More tasks means a little *more* total CPU time is required, not less,
because of the overhead of handling tasks. However, more tasks can
actually mean less wall-clock time.

This is because tasks vary in how long they take. If you have 1 task
per core, the job takes as long as the slowest task and at the end all
other cores are idling.

Splitting up the tasks makes the distribution of time taken across
cores more even on average. That is, the variance of task completion
times is higher than the variance of the sum of N task completion
times each of which is 1/N the size.

Whether this is actually better depends on the per-task overhead and
variance in execution time. With high overhead and low variance, one
task per core is probably optimal.

On Tue, Feb 17, 2015 at 3:38 PM, Igor Petrov igorpetrov...@gmail.com wrote:
 Hello,

 thank for your replies. The question is actually about the recommendation in
 Spark docs: Typically you want 2-4 partitions for each CPU in your cluster.

 Why having several partitions per CPU is better than one partition per CPU?
 How one CPU can handle several tasks faster than one task?

 Thank You


 On Fri, Feb 13, 2015 at 2:44 PM, Puneet Kumar Ojha
 puneet.ku...@pubmatic.com wrote:

 Use below configuration if u r using 1.2 version:-

 SET spark.shuffle.consolidateFiles=true;
 SET spark.rdd.compress=true;
 SET spark.default.parallelism=1000;
 SET spark.deploy.defaultCores=54;

 Thanks
 Puneet.

 -Original Message-
 From: Sean Owen [mailto:so...@cloudera.com]
 Sent: Friday, February 13, 2015 4:46 PM
 To: Igor Petrov
 Cc: user@spark.apache.org
 Subject: Re: Tuning number of partitions per CPU

 18 cores or 36? doesn't probably matter.
 For this case where you have some overhead per partition of setting up the
 DB connection, it may indeed not help to chop up the data more finely than
 your total parallelism. Although that would imply quite an overhead. Are you
 doing any other expensive initialization per partition in your code?
 You might check some other basic things, like, are you bottlenecked on the
 DB (probably not) and are there task stragglers drawing out the completion
 time.

 On Fri, Feb 13, 2015 at 11:06 AM, Igor Petrov igorpetrov...@gmail.com
 wrote:
  Hello,
 
  In Spark programming guide
  (http://spark.apache.org/docs/1.2.0/programming-guide.html) there is a
  recommendation:
  Typically you want 2-4 partitions for each CPU in your cluster.
 
  We have a Spark Master and two Spark workers each with 18 cores and 18
  GB of RAM.
  In our application we use JdbcRDD to load data from a DB and then cache
  it.
  We load entities from a single table, now we have 76 million of
  entities (entity size in memory is about 160 bytes). We call count()
  during application startup to force entities loading. Here are our
  measurements for
  count() operation (cores x partitions = time):
  36x36 = 6.5 min
  36x72 = 7.7 min
  36x108 = 9.4 min
 
  So despite recommendations the most efficient setup is one partition
  per core. What is the reason for above recommendation?
 
  Java 8, Apache Spark 1.1.0
 
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Tuning-number-of-p
  artitions-per-CPU-tp21642.html Sent from the Apache Spark User List
  mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
  additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tuning number of partitions per CPU

2015-02-13 Thread Sean Owen
18 cores or 36? doesn't probably matter.
For this case where you have some overhead per partition of setting up
the DB connection, it may indeed not help to chop up the data more
finely than your total parallelism. Although that would imply quite an
overhead. Are you doing any other expensive initialization per
partition in your code?
You might check some other basic things, like, are you bottlenecked on
the DB (probably not) and are there task stragglers drawing out the
completion time.

On Fri, Feb 13, 2015 at 11:06 AM, Igor Petrov igorpetrov...@gmail.com wrote:
 Hello,

 In Spark programming guide
 (http://spark.apache.org/docs/1.2.0/programming-guide.html) there is a
 recommendation:
 Typically you want 2-4 partitions for each CPU in your cluster.

 We have a Spark Master and two Spark workers each with 18 cores and 18 GB of
 RAM.
 In our application we use JdbcRDD to load data from a DB and then cache it.
 We load entities from a single table, now we have 76 million of entities
 (entity size in memory is about 160 bytes). We call count() during
 application startup to force entities loading. Here are our measurements for
 count() operation (cores x partitions = time):
 36x36 = 6.5 min
 36x72 = 7.7 min
 36x108 = 9.4 min

 So despite recommendations the most efficient setup is one partition per
 core. What is the reason for above recommendation?

 Java 8, Apache Spark 1.1.0




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Tuning-number-of-partitions-per-CPU-tp21642.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Tuning number of partitions per CPU

2015-02-13 Thread Puneet Kumar Ojha
Use below configuration if u r using 1.2 version:-

SET spark.shuffle.consolidateFiles=true;
SET spark.rdd.compress=true;
SET spark.default.parallelism=1000;
SET spark.deploy.defaultCores=54;

Thanks
Puneet.

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Friday, February 13, 2015 4:46 PM
To: Igor Petrov
Cc: user@spark.apache.org
Subject: Re: Tuning number of partitions per CPU

18 cores or 36? doesn't probably matter.
For this case where you have some overhead per partition of setting up the DB 
connection, it may indeed not help to chop up the data more finely than your 
total parallelism. Although that would imply quite an overhead. Are you doing 
any other expensive initialization per partition in your code?
You might check some other basic things, like, are you bottlenecked on the DB 
(probably not) and are there task stragglers drawing out the completion time.

On Fri, Feb 13, 2015 at 11:06 AM, Igor Petrov igorpetrov...@gmail.com wrote:
 Hello,

 In Spark programming guide
 (http://spark.apache.org/docs/1.2.0/programming-guide.html) there is a
 recommendation:
 Typically you want 2-4 partitions for each CPU in your cluster.

 We have a Spark Master and two Spark workers each with 18 cores and 18 
 GB of RAM.
 In our application we use JdbcRDD to load data from a DB and then cache it.
 We load entities from a single table, now we have 76 million of 
 entities (entity size in memory is about 160 bytes). We call count() 
 during application startup to force entities loading. Here are our 
 measurements for
 count() operation (cores x partitions = time):
 36x36 = 6.5 min
 36x72 = 7.7 min
 36x108 = 9.4 min

 So despite recommendations the most efficient setup is one partition 
 per core. What is the reason for above recommendation?

 Java 8, Apache Spark 1.1.0




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Tuning-number-of-p
 artitions-per-CPU-tp21642.html Sent from the Apache Spark User List 
 mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org