Re: How to assign consecutive numeric id to each row based on its content?

2014-11-25 Thread shahab
Thanks a lot, both solutions work. best, /Shahab On Tue, Nov 18, 2014 at 5:28 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to increment them like so: val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1)

How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread shahab
Hi, In my spark application, I am loading some rows from database into Spark RDDs Each row has several fields, and a string key. Due to my requirements I need to work with consecutive numeric ids (starting from 1 to N, where N is the number of unique keys) instead of string keys . Also several

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Cheng Lian
A not so efficient way can be this: |val r0: RDD[OriginalRow] = ... val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row)) val r2 = r1.keys.distinct().zipWithIndex() val r3 = r2.join(r1).values | On 11/18/14 8:54 PM, shahab wrote: Hi, In my spark application, I am loading some

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Daniel Siegmann
I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to increment them like so: val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1) If the number of distinct keys is relatively small, you might consider collecting them into a map and broadcasting them rather than using