Re: How number of partitions effect the performance?

Sean Owen Mon, 03 Nov 2014 03:10:12 -0800

Yes partitions matter. Usually you can use the default, which will
make a partition per input split, and that's usually good, to let one
task process one block of data, which will all be on one machine.

Reasons I could imagine why 9 partitions is faster than 7:

Probably: Your cluster can execute at least 9 tasks concurrently. It
will finish faster since each partition is smaller when split into 9
partitions. This just means you weren't using your cluster's full
parallelism at 7.

9 partitions lets tasks execute entirely locally to the data, whereas
7 is too few compared to how the data blocks are distributed on HDFS.
That is, maybe 7 is inducing a shuffle whereas 9 is not for some
reason in your code.

Your executors are running near their memory limit and are thrashing
in GC. With less data to process each, you may avoid thrashing and so
go a lot faster.

(Or there's some other factor that messed up your measurements :))

There can be instances where more partitions is slower too.

On Mon, Nov 3, 2014 at 9:57 AM, shahab <shahab.mok...@gmail.com> wrote:
> Hi,
>
> I just wonder how number of partitions effect the performance in Spark!
>
> Is it just the parallelism (more partitions, more parallel sub-tasks) that
> improves the performance? or there exist other considerations?
>
> In my case,I run couple of map/reduce jobs on same dataset two times with
> two different partition numbers, 7 and 9. I used a stand alone cluster, with
> two workers on each, where the master resides with the same machine as one
> of the workers.
>
> Surprisingly, the performance of map/reduce jobs in case of 9 partitions is
> almost  4X-5X better than that of 7 partitions !??  Does it mean that
> choosing right number of partitions is the key factor in the Spark
> performance ?
>
> best,
> /Shahab

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How number of partitions effect the performance?

Reply via email to