I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g.
rdd1: (id1, Long text 1), (id2, Long text 2), (id3, Long text 3)
rdd2: (id1, Long text 1 A), (id2, Long text 2 A)
rdd3: (id1, Long text 1 B)
Then, I want to merge all RDDs. If there is duplicated docids, later
reducebyKey should work, but you need to define the ordering by using some
sort of index.
On Fri, Feb 13, 2015 at 12:38 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I have multiple RDD[(String, String)] that store (docId, docText) pairs,
e.g.
rdd1: (“id1”, “Long text
is appreciated because I am new to Spark.
Ningjun
From: Boromir Widas [mailto:vcsub...@gmail.com]
Sent: Friday, February 13, 2015 1:28 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to union RDD and remove duplicated keys
reducebyKey should work, but you need to define the ordering
because I am new to Spark.
Ningjun
*From:* Boromir Widas [mailto:vcsub...@gmail.com]
*Sent:* Friday, February 13, 2015 1:28 PM
*To:* Wang, Ningjun (LNG-NPV)
*Cc:* user@spark.apache.org
*Subject:* Re: How to union RDD and remove duplicated keys
reducebyKey should work, but you need