Re: workaround for groupByKey

2015-06-23 Thread Silvio Fiorito
a mapPartitions or one of the other combineByKey APIs? From: Jianguo Li Date: Tuesday, June 23, 2015 at 9:46 AM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: workaround for groupByKey Thanks. Yes, unfortunately, they all need to be grouped. I guess I can

Re: workaround for groupByKey

2015-06-23 Thread Jianguo Li
? From: Jianguo Li Date: Monday, June 22, 2015 at 6:21 PM To: Silvio Fiorito Cc: user@spark.apache.org Subject: Re: workaround for groupByKey Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. I read in the Learning Sparking *We can disable map-side aggregation

workaround for groupByKey

2015-06-22 Thread Jianguo Li
Hi, I am processing an RDD of key-value pairs. The key is an user_id, and the value is an website url the user has ever visited. Since I need to know all the urls each user has visited, I am tempted to call the groupByKey on this RDD. However, since there could be millions of users and urls,

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
There is reduceByKey that works on K,V. You need to accumulate partial results and proceed. does your computation allow that ? On Mon, Jun 22, 2015 at 2:12 PM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, I am processing an RDD of key-value pairs. The key is an user_id, and the value is

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
test = input.aggregateByKey(ListBuffer.empty[String])((a, b) = a += b, (a, b) = a ++ b) From: Jianguo Li Date: Monday, June 22, 2015 at 5:12 PM To: user@spark.apache.org Subject: workaround for groupByKey Hi, I am processing an RDD of key-value pairs. The key is an user_id

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
perhaps? From: Jianguo Li Date: Monday, June 22, 2015 at 6:21 PM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: workaround for groupByKey Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. I read in the Learning Sparking We can