Each RDD is composed of multiple blocks known as partitions, when you apply
transformation over it, then it can grow in size depending on the operation
(as the # objects/references increase) and that is probably the reason why
you are seeing increased number of partitions.

I don't think increased number of partitions will cause any performance
decrease as it helps to evenly distribute the tasks across machines and per
core. If you don't want more partitions, then you can do a .repartition
over the transformed RDD.

Custom partitioner can improve the performance depending on the usecase
that you are having.

Thanks
Best Regards

On Fri, Mar 27, 2015 at 4:39 PM, sayantini <sayantiniba...@gmail.com> wrote:

> In our application where we load our historical data in 40 partitioned
> RDDs (no. of available cores X 2) and we have not implemented any custom
> partitioner.
>
> After applying transformations on these RDDs intermediate RDDs are created
> which have partitions greater than 40 and sometimes partitions are going up
> till 300.
>
> 1. Is Spark intelligent enough to manage the partitions of RDD? Please
> suggest why there is an increase in the no. of partitions?
>
> 2. We suspect that increasing the no. of partitions is causing decrease in
> performance.
>
> 3. If we create a custom Partitioner will it improve our performance?
>
>
>
> Thanks,
>
> Sayantini
>

Reply via email to