I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g.

rdd1:   ("id1", "Long text 1"), ("id2", "Long text 2"), ("id3", "Long text 3")
rdd2:   ("id1", "Long text 1 A"), ("id2", "Long text 2 A")
rdd3:   ("id1", "Long text 1 B")

Then, I want to merge all RDDs. If there is duplicated docids, later RDD should 
overwrite previous RDD. In the above case, rdd2 will overwrite rddd1 for "id1" 
and "id2", then rdd3 will overwrite rdd2 for "id1". The final merged rdd should 
be

rddFinal: ("id1", "Long text 1 B"), ("id2", "Long text 2 A"), ("id3", "Long 
text 3")

Note that I have many such RDDs and each rdd have lots of elements. How can I 
do it efficiently?


Ningjun

Reply via email to