I believe groupByKey currently requires that all items for a specific key fit into a single and executive's memory: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
This previous discussion has some pointers if you must use groupByKey, including adding a low-cardinality hash to your key: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-td11427.html Another option I didn't see mentioned would be to persist / cache the initial RDD, calculate the set of distinct key values out of it, and then derive a set of filtered RDDs from the cached dataset, one for each key. For this to work, your set of unique keys would need to fit into your driver's memory. Regards, Will On June 6, 2015, at 11:07 AM, patcharee <patcharee.thong...@uni.no> wrote: Hi, How can I write to multiple outputs for each key? I tried to create custom partitioner or define the number of partition but does not work. There are only the few tasks/partitions (which equals to the number of all key combination) gets large datasets, data is not splitting to all tasks/partition. The job failed as the few tasks handled too far large datasets. Below is my code snippet. val varWFlatRDD = varWRDD.map(FlatMapUtilClass().flatKeyFromWrf).groupByKey() //key are (zone, z, year, month) .foreach( x => { val z = x._1._1 val year = x._1._2 val month = x._1._3 val df_table_4dim = x._2.toList.toDF() df_table_4dim.registerTempTable("table_4Dim") hiveContext.sql("INSERT OVERWRITE table 4dim partition (zone=" + ZONE + ",z=" + z + ",year=" + year + ",month=" + month + ") " + "select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim"); }) From the spark history UI, at groupByKey there are > 1000 tasks (equals to the parent's # partitions). at foreach there are > 1000 tasks as well, but 50 tasks (same as the # all key combination) gets datasets. How can I fix this problem? Any suggestions are appreciated. BR, Patcharee --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org