Dataset API | Setting number of partitions during join/groupBy

Aniket Bhatnagar Fri, 11 Nov 2016 09:23:50 -0800

Hi

I can't seem to find a way to pass number of partitions while join 2
Datasets or doing a groupBy operation on the Dataset. There is an option of
repartitioning the resultant Dataset but it's inefficient to repartition
after the Dataset has been joined/grouped into default number of
partitions. With RDD API, this was easy to do as the functions accepted a
numPartitions parameter. The only way to do this seems to be
sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but
this means that all join/groupBy operations going forward will have the
same number of partitions.


Thanks,
Aniket

Dataset API | Setting number of partitions during join/groupBy

Reply via email to