Re: write multiple outputs by key

Will Briggs Sat, 06 Jun 2015 11:25:42 -0700

I believe groupByKey currently requires that all items for a specific key fit 
into a single and executive's memory: 
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html


This previous discussion has some pointers if you must use groupByKey, 
including adding a low-cardinality hash to your key: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-td11427.html

Another option I didn't see mentioned would be to persist / cache the initial 
RDD, calculate the set of distinct key values out of it, and then derive a set 
of filtered RDDs from the cached dataset, one for each key. For this to work, 
your set of unique keys would need to fit into your driver's memory.

Regards,
Will

On June 6, 2015, at 11:07 AM, patcharee <patcharee.thong...@uni.no> wrote:

Hi,

How can I write to multiple outputs for each key? I tried to create 
custom partitioner or define the number of partition but does not work. 
There are only the few tasks/partitions (which equals to the number of 
all key combination) gets large datasets, data is not splitting to all 
tasks/partition. The job failed as the few tasks handled too far large 
datasets. Below is my code snippet.

val varWFlatRDD = 
varWRDD.map(FlatMapUtilClass().flatKeyFromWrf).groupByKey() //key are 
(zone, z, year, month)
     .foreach(
         x => {
           val z = x._1._1
           val year = x._1._2
           val month = x._1._3
           val df_table_4dim = x._2.toList.toDF()
           df_table_4dim.registerTempTable("table_4Dim")
           hiveContext.sql("INSERT OVERWRITE table 4dim partition 
(zone=" + ZONE + ",z=" + z + ",year=" + year + ",month=" + month + ") " +
             "select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim");
       })


 From the spark history UI, at groupByKey there are > 1000 tasks (equals 
to the parent's # partitions). at foreach there are > 1000 tasks as 
well, but 50 tasks (same as the # all key combination)  gets datasets.

How can I fix this problem? Any suggestions are appreciated.

BR,
Patcharee



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: write multiple outputs by key

Reply via email to