A not so efficient way can be this:
|val r0: RDD[OriginalRow] = ...
val r1 = r0.keyBy(row => extractKeyFromOriginalRow(row))
val r2 = r1.keys.distinct().zipWithIndex()
val r3 = r2.join(r1).values
|
On 11/18/14 8:54 PM, shahab wrote:
Hi,
In my spark application, I am loading some rows from database into
Spark RDDs
Each row has several fields, and a string key. Due to my requirements
I need to work with consecutive numeric ids (starting from 1 to N,
where N is the number of unique keys) instead of string keys . Also
several rows can have same string key .
In spark context, how I can map each row into (Numeric_Key,
OriginalRow) as map/reduce tasks such that rows with same original
string key get same numeric consecutive key?
Any hints?
best,
/Shahab