Each RDD is composed of multiple blocks known as partitions, when you apply
transformation over it, then it can grow in size depending on the operation
(as the # objects/references increase) and that is probably the reason why
you are seeing increased number of partitions.
I don't think increased number of partitions will cause any performance
decrease as it helps to evenly distribute the tasks across machines and per
core. If you don't want more partitions, then you can do a .repartition
over the transformed RDD.
Custom partitioner can improve the performance depending on the usecase
that you are having.
Thanks
Best Regards
On Fri, Mar 27, 2015 at 4:39 PM, sayantini sayantiniba...@gmail.com wrote:
In our application where we load our historical data in 40 partitioned
RDDs (no. of available cores X 2) and we have not implemented any custom
partitioner.
After applying transformations on these RDDs intermediate RDDs are created
which have partitions greater than 40 and sometimes partitions are going up
till 300.
1. Is Spark intelligent enough to manage the partitions of RDD? Please
suggest why there is an increase in the no. of partitions?
2. We suspect that increasing the no. of partitions is causing decrease in
performance.
3. If we create a custom Partitioner will it improve our performance?
Thanks,
Sayantini