I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to increment them like so:
val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1) If the number of distinct keys is relatively small, you might consider collecting them into a map and broadcasting them rather than using a join, like so: val keyIndices = sc.broadcast(r2.collect.toMap) val r3 = r1.map { case (k, v) => (keyIndices(k), v) } On Tue, Nov 18, 2014 at 8:16 AM, Cheng Lian <lian.cs....@gmail.com> wrote: > A not so efficient way can be this: > > val r0: RDD[OriginalRow] = ...val r1 = r0.keyBy(row => > extractKeyFromOriginalRow(row))val r2 = r1.keys.distinct().zipWithIndex()val > r3 = r2.join(r1).values > > On 11/18/14 8:54 PM, shahab wrote: > > Hi, > > In my spark application, I am loading some rows from database into Spark > RDDs > Each row has several fields, and a string key. Due to my requirements I > need to work with consecutive numeric ids (starting from 1 to N, where N is > the number of unique keys) instead of string keys . Also several rows can > have same string key . > > In spark context, how I can map each row into (Numeric_Key, OriginalRow) > as map/reduce tasks such that rows with same original string key get same > numeric consecutive key? > > Any hints? > > best, > /Shahab > > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io