Do you mean first union all RDDs together and then do a reduceByKey()? Suppose my unioned RDD is
rdd : (“id1”, “text 1”), (“id1”, “text 2”), (“id1”, “text 3”) How can I use reduceByKey to return (“id1”, “text 3”) ? I mean to take the last one entry for the same key Code snippet is appreciated because I am new to Spark. Ningjun From: Boromir Widas [mailto:[email protected]] Sent: Friday, February 13, 2015 1:28 PM To: Wang, Ningjun (LNG-NPV) Cc: [email protected] Subject: Re: How to union RDD and remove duplicated keys reducebyKey should work, but you need to define the ordering by using some sort of index. On Fri, Feb 13, 2015 at 12:38 PM, Wang, Ningjun (LNG-NPV) <[email protected]<mailto:[email protected]>> wrote: I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g. rdd1: (“id1”, “Long text 1”), (“id2”, “Long text 2”), (“id3”, “Long text 3”) rdd2: (“id1”, “Long text 1 A”), (“id2”, “Long text 2 A”) rdd3: (“id1”, “Long text 1 B”) Then, I want to merge all RDDs. If there is duplicated docids, later RDD should overwrite previous RDD. In the above case, rdd2 will overwrite rddd1 for “id1” and “id2”, then rdd3 will overwrite rdd2 for “id1”. The final merged rdd should be rddFinal: (“id1”, “Long text 1 B”), (“id2”, “Long text 2 A”), (“id3”, “Long text 3”) Note that I have many such RDDs and each rdd have lots of elements. How can I do it efficiently? Ningjun
