Re: How to assign consecutive numeric id to each row based on its content?
Thanks a lot, both solutions work. best, /Shahab On Tue, Nov 18, 2014 at 5:28 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to increment them like so: val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1) If the number of distinct keys is relatively small, you might consider collecting them into a map and broadcasting them rather than using a join, like so: val keyIndices = sc.broadcast(r2.collect.toMap) val r3 = r1.map { case (k, v) = (keyIndices(k), v) } On Tue, Nov 18, 2014 at 8:16 AM, Cheng Lian lian.cs@gmail.com wrote: A not so efficient way can be this: val r0: RDD[OriginalRow] = ...val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row))val r2 = r1.keys.distinct().zipWithIndex()val r3 = r2.join(r1).values On 11/18/14 8:54 PM, shahab wrote: Hi, In my spark application, I am loading some rows from database into Spark RDDs Each row has several fields, and a string key. Due to my requirements I need to work with consecutive numeric ids (starting from 1 to N, where N is the number of unique keys) instead of string keys . Also several rows can have same string key . In spark context, how I can map each row into (Numeric_Key, OriginalRow) as map/reduce tasks such that rows with same original string key get same numeric consecutive key? Any hints? best, /Shahab -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io
How to assign consecutive numeric id to each row based on its content?
Hi, In my spark application, I am loading some rows from database into Spark RDDs Each row has several fields, and a string key. Due to my requirements I need to work with consecutive numeric ids (starting from 1 to N, where N is the number of unique keys) instead of string keys . Also several rows can have same string key . In spark context, how I can map each row into (Numeric_Key, OriginalRow) as map/reduce tasks such that rows with same original string key get same numeric consecutive key? Any hints? best, /Shahab
Re: How to assign consecutive numeric id to each row based on its content?
A not so efficient way can be this: |val r0: RDD[OriginalRow] = ... val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row)) val r2 = r1.keys.distinct().zipWithIndex() val r3 = r2.join(r1).values | On 11/18/14 8:54 PM, shahab wrote: Hi, In my spark application, I am loading some rows from database into Spark RDDs Each row has several fields, and a string key. Due to my requirements I need to work with consecutive numeric ids (starting from 1 to N, where N is the number of unique keys) instead of string keys . Also several rows can have same string key . In spark context, how I can map each row into (Numeric_Key, OriginalRow) as map/reduce tasks such that rows with same original string key get same numeric consecutive key? Any hints? best, /Shahab
Re: How to assign consecutive numeric id to each row based on its content?
I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to increment them like so: val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1) If the number of distinct keys is relatively small, you might consider collecting them into a map and broadcasting them rather than using a join, like so: val keyIndices = sc.broadcast(r2.collect.toMap) val r3 = r1.map { case (k, v) = (keyIndices(k), v) } On Tue, Nov 18, 2014 at 8:16 AM, Cheng Lian lian.cs@gmail.com wrote: A not so efficient way can be this: val r0: RDD[OriginalRow] = ...val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row))val r2 = r1.keys.distinct().zipWithIndex()val r3 = r2.join(r1).values On 11/18/14 8:54 PM, shahab wrote: Hi, In my spark application, I am loading some rows from database into Spark RDDs Each row has several fields, and a string key. Due to my requirements I need to work with consecutive numeric ids (starting from 1 to N, where N is the number of unique keys) instead of string keys . Also several rows can have same string key . In spark context, how I can map each row into (Numeric_Key, OriginalRow) as map/reduce tasks such that rows with same original string key get same numeric consecutive key? Any hints? best, /Shahab -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io