Hi All,

I'm trying to create a Dataset from RDD and do groupBy on the Dataset. The
groupBy stage runs with 200 partitions. Although the RDD had 5000
partitions. I also seem to have no way to change that 200 partitions on the
Dataset to some other large number. This seems to be affecting the
parallelism as there are 700 executors and only 200 partitions.

The code looks somewhat like:

val sqsDstream = sparkStreamingContext.union((1 to 3).map(_ =>
      sparkStreamingContext.receiverStream(new SQSReceiver())
    ).transform(_.repartition(5000))

sqsDstream.foreachRDD(rdd => {
      val dataSet = sparkSession.createDataset(rdd)
      val aggregatedDataset: Dataset[Row] =
                  dataSet.groupBy("primaryKey").agg(udaf("key1"))
      aggregatedDataset.foreachPartition(partition => {
             //write to output stream
       })
})


Any pointers would be appreciated.
Thanks,
Bharath

Reply via email to