Re: workaround for groupByKey

๏̯͡๏ Mon, 22 Jun 2015 14:57:19 -0700

Silvio,

Suppose my RDD is (K-1, v1,v2,v3,v4).
If i want to do simple addition i can use reduceByKey or aggregateByKey.
What if my processing needs to check all the items in the value list each
time, Above two operations do not get all the values, they just get two
pairs (v1, v2) , you do some processing and store it back into v1.


How do i use the combiner facility present with reduceByKey &
aggregateByKey.

-deepak

On Mon, Jun 22, 2015 at 2:43 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>  You can use aggregateByKey as one option:
>
>  val input: RDD[Int, String] = ...
>
>  val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a +=
> b, (a, b) => a ++ b)
>
>   From: Jianguo Li
> Date: Monday, June 22, 2015 at 5:12 PM
> To: "user@spark.apache.org"
> Subject: workaround for groupByKey
>
>   Hi,
>
>  I am processing an RDD of key-value pairs. The key is an user_id, and
> the value is an website url the user has ever visited.
>
>  Since I need to know all the urls each user has visited, I am  tempted
> to call the groupByKey on this RDD. However, since there could be millions
> of users and urls, the shuffling caused by groupByKey proves to be a major
> bottleneck to get the job done. Is there any workaround? I want to end up
> with an RDD of key-value pairs, where the key is an user_id, the value is a
> list of all the urls visited by the user.
>
>  Thanks,
>
>  Jianguo
>



-- 
Deepak

Re: workaround for groupByKey

Reply via email to