Re: How to union RDD and remove duplicated keys

Boromir Widas Fri, 13 Feb 2015 13:46:28 -0800

I have not run the following, but will be on these lines -

rdd.zipWithIndex().map(x => (x._1._1, (x._1._2, x._2))).reduceByKey((a, b)
=> { if(a._2 > b._2) a else b }).map(x => (x._1, x._2._1))


On Fri, Feb 13, 2015 at 3:27 PM, Wang, Ningjun (LNG-NPV) <
[email protected]> wrote:

>  Do you mean first union all RDDs together and then do a reduceByKey()?
> Suppose my unioned RDD is
>
>
>
> rdd :  (“id1”, “text 1”),  (“id1”, “text 2”), (“id1”, “text 3”)
>
> How can I use reduceByKey to return  (“id1”, “text 3”) ? I mean to take
> the last one entry for the same key
>
> Code snippet is appreciated because I am new to Spark.
>
> Ningjun
>
>
>
> *From:* Boromir Widas [mailto:[email protected]]
> *Sent:* Friday, February 13, 2015 1:28 PM
> *To:* Wang, Ningjun (LNG-NPV)
> *Cc:* [email protected]
> *Subject:* Re: How to union RDD and remove duplicated keys
>
>
>
> reducebyKey should work, but you need to define the ordering by using some
> sort of index.
>
>
>
> On Fri, Feb 13, 2015 at 12:38 PM, Wang, Ningjun (LNG-NPV) <
> [email protected]> wrote:
>
>
>
> I have multiple RDD[(String, String)] that store (docId, docText) pairs,
> e.g.
>
>
>
> rdd1:   (“id1”, “Long text 1”), (“id2”, “Long text 2”), (“id3”, “Long text
> 3”)
>
> rdd2:   (“id1”, “Long text 1 A”), (“id2”, “Long text 2 A”)
>
> rdd3:   (“id1”, “Long text 1 B”)
>
>
>
> Then, I want to merge all RDDs. If there is duplicated docids, later RDD
> should overwrite previous RDD. In the above case, rdd2 will overwrite rddd1
> for “id1” and “id2”, then rdd3 will overwrite rdd2 for “id1”. The final
> merged rdd should be
>
>
>
> rddFinal: (“id1”, “Long text 1 B”), (“id2”, “Long text 2 A”), (“id3”,
> “Long text 3”)
>
>
>
> Note that I have many such RDDs and each rdd have lots of elements. How
> can I do it efficiently?
>
>
>
>
>
> Ningjun
>
>
>
>
>

Re: How to union RDD and remove duplicated keys

Reply via email to