Re: How to assign consecutive numeric id to each row based on its content?

2014-11-25 Thread shahab
Thanks a lot, both solutions work.

best,
/Shahab

On Tue, Nov 18, 2014 at 5:28 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to
 increment them like so:

 val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1)

 If the number of distinct keys is relatively small, you might consider
 collecting them into a map and broadcasting them rather than using a join,
 like so:

 val keyIndices = sc.broadcast(r2.collect.toMap)
 val r3 = r1.map { case (k, v) = (keyIndices(k), v) }

 On Tue, Nov 18, 2014 at 8:16 AM, Cheng Lian lian.cs@gmail.com wrote:

  A not so efficient way can be this:

 val r0: RDD[OriginalRow] = ...val r1 = r0.keyBy(row = 
 extractKeyFromOriginalRow(row))val r2 = r1.keys.distinct().zipWithIndex()val 
 r3 = r2.join(r1).values

 On 11/18/14 8:54 PM, shahab wrote:

   Hi,

  In my spark application, I am loading some rows from database into
 Spark RDDs
 Each row has several fields, and a string key. Due to my requirements I
 need to work with consecutive numeric ids (starting from 1 to N, where N is
 the number of unique keys) instead of string keys . Also several rows can
 have same string key .

  In spark context, how I can map each row into (Numeric_Key,
 OriginalRow) as map/reduce  tasks such that rows with same original string
 key get same numeric consecutive key?

  Any hints?

  best,
 /Shahab

   ​




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io



How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread shahab
Hi,

In my spark application, I am loading some rows from database into Spark
RDDs
Each row has several fields, and a string key. Due to my requirements I
need to work with consecutive numeric ids (starting from 1 to N, where N is
the number of unique keys) instead of string keys . Also several rows can
have same string key .

In spark context, how I can map each row into (Numeric_Key, OriginalRow) as
map/reduce  tasks such that rows with same original string key get same
numeric consecutive key?

Any hints?

best,
/Shahab


Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Cheng Lian

A not so efficient way can be this:

|val  r0:  RDD[OriginalRow] = ...
val  r1  =  r0.keyBy(row = extractKeyFromOriginalRow(row))
val  r2  =  r1.keys.distinct().zipWithIndex()
val  r3  =  r2.join(r1).values
|

On 11/18/14 8:54 PM, shahab wrote:


Hi,

In my spark application, I am loading some rows from database into 
Spark RDDs
Each row has several fields, and a string key. Due to my requirements 
I need to work with consecutive numeric ids (starting from 1 to N, 
where N is the number of unique keys) instead of string keys . Also 
several rows can have same string key .


In spark context, how I can map each row into (Numeric_Key, 
OriginalRow) as map/reduce  tasks such that rows with same original 
string key get same numeric consecutive key?


Any hints?

best,
/Shahab


​


Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Daniel Siegmann
I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to
increment them like so:

val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1)

If the number of distinct keys is relatively small, you might consider
collecting them into a map and broadcasting them rather than using a join,
like so:

val keyIndices = sc.broadcast(r2.collect.toMap)
val r3 = r1.map { case (k, v) = (keyIndices(k), v) }

On Tue, Nov 18, 2014 at 8:16 AM, Cheng Lian lian.cs@gmail.com wrote:

  A not so efficient way can be this:

 val r0: RDD[OriginalRow] = ...val r1 = r0.keyBy(row = 
 extractKeyFromOriginalRow(row))val r2 = r1.keys.distinct().zipWithIndex()val 
 r3 = r2.join(r1).values

 On 11/18/14 8:54 PM, shahab wrote:

   Hi,

  In my spark application, I am loading some rows from database into Spark
 RDDs
 Each row has several fields, and a string key. Due to my requirements I
 need to work with consecutive numeric ids (starting from 1 to N, where N is
 the number of unique keys) instead of string keys . Also several rows can
 have same string key .

  In spark context, how I can map each row into (Numeric_Key, OriginalRow)
 as map/reduce  tasks such that rows with same original string key get same
 numeric consecutive key?

  Any hints?

  best,
 /Shahab

   ​




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io