How number of partitions effect the performance?

2014-11-03 Thread shahab
Hi,

I just wonder how number of partitions effect the performance in Spark!

Is it just the parallelism (more partitions, more parallel sub-tasks) that
improves the performance? or there exist other considerations?

In my case,I run couple of map/reduce jobs on same dataset two times with
two different partition numbers, 7 and 9. I used a stand alone cluster,
with two workers on each, where the master resides with the same machine as
one of the workers.

Surprisingly, the performance of map/reduce jobs in case of 9 partitions is
almost  4X-5X better than that of 7 partitions !??  Does it mean that
choosing right number of partitions is the key factor in the Spark
performance ?

best,
/Shahab


Re: How number of partitions effect the performance?

2014-11-03 Thread Sean Owen
Yes partitions matter. Usually you can use the default, which will
make a partition per input split, and that's usually good, to let one
task process one block of data, which will all be on one machine.

Reasons I could imagine why 9 partitions is faster than 7:

Probably: Your cluster can execute at least 9 tasks concurrently. It
will finish faster since each partition is smaller when split into 9
partitions. This just means you weren't using your cluster's full
parallelism at 7.

9 partitions lets tasks execute entirely locally to the data, whereas
7 is too few compared to how the data blocks are distributed on HDFS.
That is, maybe 7 is inducing a shuffle whereas 9 is not for some
reason in your code.

Your executors are running near their memory limit and are thrashing
in GC. With less data to process each, you may avoid thrashing and so
go a lot faster.

(Or there's some other factor that messed up your measurements :))


There can be instances where more partitions is slower too.

On Mon, Nov 3, 2014 at 9:57 AM, shahab shahab.mok...@gmail.com wrote:
 Hi,

 I just wonder how number of partitions effect the performance in Spark!

 Is it just the parallelism (more partitions, more parallel sub-tasks) that
 improves the performance? or there exist other considerations?

 In my case,I run couple of map/reduce jobs on same dataset two times with
 two different partition numbers, 7 and 9. I used a stand alone cluster, with
 two workers on each, where the master resides with the same machine as one
 of the workers.

 Surprisingly, the performance of map/reduce jobs in case of 9 partitions is
 almost  4X-5X better than that of 7 partitions !??  Does it mean that
 choosing right number of partitions is the key factor in the Spark
 performance ?

 best,
 /Shahab

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How number of partitions effect the performance?

2014-11-03 Thread shahab
Thanks Sean for very useful comments. I understand now better what could be
the reasons that my evaluations are messed up.

best,
/Shahab

On Mon, Nov 3, 2014 at 12:08 PM, Sean Owen so...@cloudera.com wrote:

 Yes partitions matter. Usually you can use the default, which will
 make a partition per input split, and that's usually good, to let one
 task process one block of data, which will all be on one machine.

 Reasons I could imagine why 9 partitions is faster than 7:

 Probably: Your cluster can execute at least 9 tasks concurrently. It
 will finish faster since each partition is smaller when split into 9
 partitions. This just means you weren't using your cluster's full
 parallelism at 7.

 9 partitions lets tasks execute entirely locally to the data, whereas
 7 is too few compared to how the data blocks are distributed on HDFS.
 That is, maybe 7 is inducing a shuffle whereas 9 is not for some
 reason in your code.

 Your executors are running near their memory limit and are thrashing
 in GC. With less data to process each, you may avoid thrashing and so
 go a lot faster.

 (Or there's some other factor that messed up your measurements :))


 There can be instances where more partitions is slower too.

 On Mon, Nov 3, 2014 at 9:57 AM, shahab shahab.mok...@gmail.com wrote:
  Hi,
 
  I just wonder how number of partitions effect the performance in Spark!
 
  Is it just the parallelism (more partitions, more parallel sub-tasks)
 that
  improves the performance? or there exist other considerations?
 
  In my case,I run couple of map/reduce jobs on same dataset two times with
  two different partition numbers, 7 and 9. I used a stand alone cluster,
 with
  two workers on each, where the master resides with the same machine as
 one
  of the workers.
 
  Surprisingly, the performance of map/reduce jobs in case of 9 partitions
 is
  almost  4X-5X better than that of 7 partitions !??  Does it mean that
  choosing right number of partitions is the key factor in the Spark
  performance ?
 
  best,
  /Shahab