Hi Shreya
Initial partitions in the Datasets were more than 1000 and after a group by
operation, the resultant Dataset had only 200 partitions (because by
default number of partitions being set to 200). Any further operations on
the resultant Dataset will have a maximum of 200 parallelism
Curious – why do you want to repartition? Is there a subsequent step which
fails because the number of partitions is less? Or you want to do it for a perf
gain?
Also, what were your initial Dataset partitions and how many did you have for
the result of join?
From: Aniket Bhatnagar