I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g.
rdd1: ("id1", "Long text 1"), ("id2", "Long text 2"), ("id3", "Long text 3") rdd2: ("id1", "Long text 1 A"), ("id2", "Long text 2 A") rdd3: ("id1", "Long text 1 B") Then, I want to merge all RDDs. If there is duplicated docids, later RDD should overwrite previous RDD. In the above case, rdd2 will overwrite rddd1 for "id1" and "id2", then rdd3 will overwrite rdd2 for "id1". The final merged rdd should be rddFinal: ("id1", "Long text 1 B"), ("id2", "Long text 2 A"), ("id3", "Long text 3") Note that I have many such RDDs and each rdd have lots of elements. How can I do it efficiently? Ningjun