Decrease In Performance due to Auto Increase of Partitions in Spark

2015-03-27 Thread sayantini
In our application where we load our historical data in 40 partitioned RDDs
(no. of available cores X 2) and we have not implemented any custom
partitioner.

After applying transformations on these RDDs intermediate RDDs are created
which have partitions greater than 40 and sometimes partitions are going up
till 300.

1. Is Spark intelligent enough to manage the partitions of RDD? Please
suggest why there is an increase in the no. of partitions?

2. We suspect that increasing the no. of partitions is causing decrease in
performance.

3. If we create a custom Partitioner will it improve our performance?



Thanks,

Sayantini


Re: Decrease In Performance due to Auto Increase of Partitions in Spark

2015-03-27 Thread Akhil Das
Each RDD is composed of multiple blocks known as partitions, when you apply
transformation over it, then it can grow in size depending on the operation
(as the # objects/references increase) and that is probably the reason why
you are seeing increased number of partitions.

I don't think increased number of partitions will cause any performance
decrease as it helps to evenly distribute the tasks across machines and per
core. If you don't want more partitions, then you can do a .repartition
over the transformed RDD.

Custom partitioner can improve the performance depending on the usecase
that you are having.

Thanks
Best Regards

On Fri, Mar 27, 2015 at 4:39 PM, sayantini sayantiniba...@gmail.com wrote:

 In our application where we load our historical data in 40 partitioned
 RDDs (no. of available cores X 2) and we have not implemented any custom
 partitioner.

 After applying transformations on these RDDs intermediate RDDs are created
 which have partitions greater than 40 and sometimes partitions are going up
 till 300.

 1. Is Spark intelligent enough to manage the partitions of RDD? Please
 suggest why there is an increase in the no. of partitions?

 2. We suspect that increasing the no. of partitions is causing decrease in
 performance.

 3. If we create a custom Partitioner will it improve our performance?



 Thanks,

 Sayantini