Yes partitions matter. Usually you can use the default, which will make a partition per input split, and that's usually good, to let one task process one block of data, which will all be on one machine.
Reasons I could imagine why 9 partitions is faster than 7: Probably: Your cluster can execute at least 9 tasks concurrently. It will finish faster since each partition is smaller when split into 9 partitions. This just means you weren't using your cluster's full parallelism at 7. 9 partitions lets tasks execute entirely locally to the data, whereas 7 is too few compared to how the data blocks are distributed on HDFS. That is, maybe 7 is inducing a shuffle whereas 9 is not for some reason in your code. Your executors are running near their memory limit and are thrashing in GC. With less data to process each, you may avoid thrashing and so go a lot faster. (Or there's some other factor that messed up your measurements :)) There can be instances where more partitions is slower too. On Mon, Nov 3, 2014 at 9:57 AM, shahab <shahab.mok...@gmail.com> wrote: > Hi, > > I just wonder how number of partitions effect the performance in Spark! > > Is it just the parallelism (more partitions, more parallel sub-tasks) that > improves the performance? or there exist other considerations? > > In my case,I run couple of map/reduce jobs on same dataset two times with > two different partition numbers, 7 and 9. I used a stand alone cluster, with > two workers on each, where the master resides with the same machine as one > of the workers. > > Surprisingly, the performance of map/reduce jobs in case of 9 partitions is > almost 4X-5X better than that of 7 partitions !?? Does it mean that > choosing right number of partitions is the key factor in the Spark > performance ? > > best, > /Shahab --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org