How to run parallel on each DataFrame group

patcharee Thu, 05 Nov 2015 01:42:07 -0800

Hi,

I need suggestions on my coding. I would like to split DataFrame (rowDF)by a column (depth) into groups. Then sort each group, repartition andsave output of each group into one file. See code below>


val rowDF = sqlContext.createDataFrame(rowRDD, schema).cache()
for (i <- 0 to 16) {
   val filterDF = rowDF.filter("depth="+i)
   val finalDF = filterDF.sort("xy").coalesce(1)
finalDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("depth").saveAsTable(args(3))
}

The problem is each group after filtered is handled by an executor oneby one. How to change the code to allow each group run in parallel?


I looked at groupBy, but seem only for aggregation.

Thanks,
Patcharee

How to run parallel on each DataFrame group

Reply via email to