Hi,
How can I write to multiple outputs for each key? I tried to create
custom partitioner or define the number of partition but does not work.
There are only the few tasks/partitions (which equals to the number of
all key combination) gets large datasets, data is not splitting to all
tasks/partition. The job failed as the few tasks handled too far large
datasets. Below is my code snippet.
val varWFlatRDD =
varWRDD.map(FlatMapUtilClass().flatKeyFromWrf).groupByKey() //key are
(zone, z, year, month)
.foreach(
x => {
val z = x._1._1
val year = x._1._2
val month = x._1._3
val df_table_4dim = x._2.toList.toDF()
df_table_4dim.registerTempTable("table_4Dim")
hiveContext.sql("INSERT OVERWRITE table 4dim partition
(zone=" + ZONE + ",z=" + z + ",year=" + year + ",month=" + month + ") " +
"select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb,
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim");
})
From the spark history UI, at groupByKey there are > 1000 tasks (equals
to the parent's # partitions). at foreach there are > 1000 tasks as
well, but 50 tasks (same as the # all key combination) gets datasets.
How can I fix this problem? Any suggestions are appreciated.
BR,
Patcharee
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org