Hi, I just wonder how number of partitions effect the performance in Spark!
Is it just the parallelism (more partitions, more parallel sub-tasks) that improves the performance? or there exist other considerations? In my case,I run couple of map/reduce jobs on same dataset two times with two different partition numbers, 7 and 9. I used a stand alone cluster, with two workers on each, where the master resides with the same machine as one of the workers. Surprisingly, the performance of map/reduce jobs in case of 9 partitions is almost 4X-5X better than that of 7 partitions !?? Does it mean that choosing right number of partitions is the key factor in the Spark performance ? best, /Shahab