reducebyKey should work, but you need to define the ordering by using some
sort of index.

On Fri, Feb 13, 2015 at 12:38 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>
>
> I have multiple RDD[(String, String)] that store (docId, docText) pairs,
> e.g.
>
>
>
> rdd1:   (“id1”, “Long text 1”), (“id2”, “Long text 2”), (“id3”, “Long text
> 3”)
>
> rdd2:   (“id1”, “Long text 1 A”), (“id2”, “Long text 2 A”)
>
> rdd3:   (“id1”, “Long text 1 B”)
>
>
>
> Then, I want to merge all RDDs. If there is duplicated docids, later RDD
> should overwrite previous RDD. In the above case, rdd2 will overwrite rddd1
> for “id1” and “id2”, then rdd3 will overwrite rdd2 for “id1”. The final
> merged rdd should be
>
>
>
> rddFinal: (“id1”, “Long text 1 B”), (“id2”, “Long text 2 A”), (“id3”,
> “Long text 3”)
>
>
>
> Note that I have many such RDDs and each rdd have lots of elements. How
> can I do it efficiently?
>
>
>
>
>
> Ningjun
>
>
>

Reply via email to