Hi,

I need suggestions on my coding. I would like to split DataFrame (rowDF) by a column (depth) into groups. Then sort each group, repartition and save output of each group into one file. See code below>

val rowDF = sqlContext.createDataFrame(rowRDD, schema).cache()
for (i <- 0 to 16) {
   val filterDF = rowDF.filter("depth="+i)
   val finalDF = filterDF.sort("xy").coalesce(1)
finalDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("depth").saveAsTable(args(3))
}

The problem is each group after filtered is handled by an executor one by one. How to change the code to allow each group run in parallel?

I looked at groupBy, but seem only for aggregation.

Thanks,
Patcharee


Reply via email to