Re: groupByKey() and keys with many values
Hi Antonio! Thank you very much for your answer! You are right in that in my case the computation could be replaced by a reduceByKey. The thing is that my computation also involves database queries: 1. Fetch key-specific data from database into memory. This is expensive and I only want to do this once for a key. 2. Process each value using this data and update the common data 3. Store modified data to database. Here it is important to write all data for a key in one go. Is there a pattern how to implement something like this with reduceByKey? Out of curiosity: I understand why you want to discourage people from using groupByKey. But is there a technical reason why the Iterable is implemented the way it is? Kind regards, Christoph. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985p13992.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
groupByKey() and keys with many values
Hi, I already posted this question on the users mailing list (http://apache-spark-user-list.1001560.n3.nabble.com/Using-groupByKey-with-many-values-per-key-td24538.html) but did not get a reply. Maybe this is the correct forum to ask. My problem is, that doing groupByKey().mapToPair() loads all values for a key into memory which is a problem when the values don't fit into memory. This was not a problem with Hadoop map/reduce, as the Iterable passed to the reducer read from disk. In Spark, the Iterable passed to mapToPair() is backed by a CompactBuffer containing all values. Is it possible to change this behavior without modifying Spark, or is there a plan to change this? Thank you very much for your help! Christoph. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org