Hi,

In our use case of using the groupByKey(...): RDD[(K, Iterable[V]], there
might be a case that even for a single key (an extreme case though), the
associated Iterable[V] could resulting in OOM.

Is it possible to provide the above 'groupByKeyWithRDD'?

And, ideally, it would be great if the internal impl of the RDD[V] is smart
enough to only spill the data into disk upon a configured threshold. That
way, we won't sacrifice the performance for the normal cases as well.

Any suggestions/comments are welcomed. Thanks a lot!

Just a side note: we do understand the points mentioned here:
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html,
and the 'reduceByKey', 'foldByKey' don't quite fit our needs right now,
that is to say, we couldn't really avoid 'groupByKey'.

-- 
ChuChao

Reply via email to