Re: How to assign consecutive numeric id to each row based on its content?

Daniel Siegmann Tue, 18 Nov 2014 08:31:48 -0800

I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to
increment them like so:


val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1)

If the number of distinct keys is relatively small, you might consider
collecting them into a map and broadcasting them rather than using a join,
like so:

val keyIndices = sc.broadcast(r2.collect.toMap)
val r3 = r1.map { case (k, v) => (keyIndices(k), v) }

On Tue, Nov 18, 2014 at 8:16 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

>  A not so efficient way can be this:
>
> val r0: RDD[OriginalRow] = ...val r1 = r0.keyBy(row => 
> extractKeyFromOriginalRow(row))val r2 = r1.keys.distinct().zipWithIndex()val 
> r3 = r2.join(r1).values
>
> On 11/18/14 8:54 PM, shahab wrote:
>
>   Hi,
>
>  In my spark application, I am loading some rows from database into Spark
> RDDs
> Each row has several fields, and a string key. Due to my requirements I
> need to work with consecutive numeric ids (starting from 1 to N, where N is
> the number of unique keys) instead of string keys . Also several rows can
> have same string key .
>
>  In spark context, how I can map each row into (Numeric_Key, OriginalRow)
> as map/reduce  tasks such that rows with same original string key get same
> numeric consecutive key?
>
>  Any hints?
>
>  best,
> /Shahab
>
>   
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io

Re: How to assign consecutive numeric id to each row based on its content?

Reply via email to