Re: RDD.combineBy without intermediate (k,v) pair allocation

francois . garillot Thu, 29 Jan 2015 11:02:44 -0800

Oh, I’m sorry, I meant `aggregateByKey`.




https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions



—
FG

On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi <mohitja...@gmail.com> wrote:

> Francois,
> RDD.aggregate() does not support aggregation by key. But, indeed, that is the 
> kind of implementation I am looking for, one that does not allocate 
> intermediate space for storing (K,V) pairs. When working with large datasets 
> this type of intermediate memory allocation wrecks havoc with garbage 
> collection, not to mention unnecessarily increases the working memory 
> requirement of the program.
> I wonder if someone has already noticed this and there is an effort underway 
> to optimize this. If not, I will take a shot at adding this functionality.
> Mohit.
>> On Jan 27, 2015, at 1:52 PM, francois.garil...@typesafe.com wrote:
>> 
>> Have you looked at the `aggregate` function in the RDD API ? 
>> 
>> If your way of extracting the “key” (identifier) and “value” (payload) parts 
>> of the RDD elements is uniform (a function), it’s unclear to me how this 
>> would be more efficient that extracting key and value and then using 
>> combine, however.
>> 
>> —
>> FG
>> 
>> 
>> On Tue, Jan 27, 2015 at 10:17 PM, Mohit Jaggi <mohitja...@gmail.com 
>> <mailto:mohitja...@gmail.com>> wrote:
>> 
>> Hi All, 
>> I have a use case where I have an RDD (not a k,v pair) where I want to do a 
>> combineByKey() operation. I can do that by creating an intermediate RDD of 
>> k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it 
>> will be more efficient if I can avoid this intermediate RDD. Is there a way 
>> I can do this by passing in a function that extracts the key, like in 
>> RDD.groupBy()? [oops, RDD.groupBy seems to create the intermediate RDD 
>> anyway, maybe a better implementation is possible for that too?] 
>> If not, is it worth adding to the Spark API? 
>> 
>> Mohit. 
>> --------------------------------------------------------------------- 
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> 
>> 
>>

Re: RDD.combineBy without intermediate (k,v) pair allocation

Reply via email to