RE: How to union RDD and remove duplicated keys

Wang, Ningjun (LNG-NPV) Fri, 13 Feb 2015 12:29:33 -0800

Do you mean first union all RDDs together and then do a reduceByKey()? Suppose 
my unioned RDD is

rdd :  (“id1”, “text 1”),  (“id1”, “text 2”), (“id1”, “text 3”)
How can I use reduceByKey to return  (“id1”, “text 3”) ? I mean to take the 
last one entry for the same key
Code snippet is appreciated because I am new to Spark.
Ningjun

From: Boromir Widas [mailto:[email protected]]
Sent: Friday, February 13, 2015 1:28 PM
To: Wang, Ningjun (LNG-NPV)
Cc: [email protected]
Subject: Re: How to union RDD and remove duplicated keys

reducebyKey should work, but you need to define the ordering by using some sort 
of index.

On Fri, Feb 13, 2015 at 12:38 PM, Wang, Ningjun (LNG-NPV) 
<[email protected]<mailto:[email protected]>> wrote:

I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g.

rdd1:   (“id1”, “Long text 1”), (“id2”, “Long text 2”), (“id3”, “Long text 3”)
rdd2:   (“id1”, “Long text 1 A”), (“id2”, “Long text 2 A”)
rdd3:   (“id1”, “Long text 1 B”)

Then, I want to merge all RDDs. If there is duplicated docids, later RDD should 
overwrite previous RDD. In the above case, rdd2 will overwrite rddd1 for “id1” 
and “id2”, then rdd3 will overwrite rdd2 for “id1”. The final merged rdd should 
be

rddFinal: (“id1”, “Long text 1 B”), (“id2”, “Long text 2 A”), (“id3”, “Long 
text 3”)

Note that I have many such RDDs and each rdd have lots of elements. How can I 
do it efficiently?

Ningjun

RE: How to union RDD and remove duplicated keys

Reply via email to