Hi,
I need suggestions on my coding. I would like to split DataFrame (rowDF)
by a column (depth) into groups. Then sort each group, repartition and
save output of each group into one file. See code below>
val rowDF = sqlContext.createDataFrame(rowRDD, schema).cache()
for (i <- 0 to 16) {
val filterDF = rowDF.filter("depth="+i)
val finalDF = filterDF.sort("xy").coalesce(1)
finalDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("depth").saveAsTable(args(3))
}
The problem is each group after filtered is handled by an executor one
by one. How to change the code to allow each group run in parallel?
I looked at groupBy, but seem only for aggregation.
Thanks,
Patcharee