To expand on what Sean said, I would look into replacing groupByKey with reduceByKey. Also take a look at this doc <http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html>. I happen to have designed a library that was subject to the same criticism when compared to the java mapreduce API wrt the use of iterables, but neither we nor the critics could ever find a natural example of a problem when you can express a computation as a single pass through each group while using a constant amount of memory that could not be converted to using a combiner (mapreduce jargon, called a reduce in Spark and most functional circles). If you found such an example, while an obstacle for you, it would be of some interest to know what it is.
On Mon, Sep 7, 2015 at 1:31 AM Sean Owen <so...@cloudera.com> wrote: > That's how it's intended to work; if it's a problem, you probably need > to re-design your computation to not use groupByKey. Usually you can > do so. > > On Mon, Sep 7, 2015 at 9:02 AM, kaklakariada <christoph.pi...@gmail.com> > wrote: > > Hi, > > > > I already posted this question on the users mailing list > > ( > http://apache-spark-user-list.1001560.n3.nabble.com/Using-groupByKey-with-many-values-per-key-td24538.html > ) > > but did not get a reply. Maybe this is the correct forum to ask. > > > > My problem is, that doing groupByKey().mapToPair() loads all values for a > > key into memory which is a problem when the values don't fit into memory. > > This was not a problem with Hadoop map/reduce, as the Iterable passed to > the > > reducer read from disk. > > > > In Spark, the Iterable passed to mapToPair() is backed by a CompactBuffer > > containing all values. > > > > Is it possible to change this behavior without modifying Spark, or is > there > > a plan to change this? > > > > Thank you very much for your help! > > Christoph. > > > > > > > > -- > > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985.html > > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >