reducebyKey should work, but you need to define the ordering by using some sort of index.
On Fri, Feb 13, 2015 at 12:38 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > > > I have multiple RDD[(String, String)] that store (docId, docText) pairs, > e.g. > > > > rdd1: (“id1”, “Long text 1”), (“id2”, “Long text 2”), (“id3”, “Long text > 3”) > > rdd2: (“id1”, “Long text 1 A”), (“id2”, “Long text 2 A”) > > rdd3: (“id1”, “Long text 1 B”) > > > > Then, I want to merge all RDDs. If there is duplicated docids, later RDD > should overwrite previous RDD. In the above case, rdd2 will overwrite rddd1 > for “id1” and “id2”, then rdd3 will overwrite rdd2 for “id1”. The final > merged rdd should be > > > > rddFinal: (“id1”, “Long text 1 B”), (“id2”, “Long text 2 A”), (“id3”, > “Long text 3”) > > > > Note that I have many such RDDs and each rdd have lots of elements. How > can I do it efficiently? > > > > > > Ningjun > > >