best way to generate per key auto increment numerals after sorting

2015-10-19 Thread fahad shah
Hi I wanted to ask whats the best way to achieve per key auto increment numerals after sorting, for eg. : raw file: 1,a,b,c,1,1 1,a,b,d,0,0 1,a,b,e,1,0 2,a,e,c,0,0 2,a,f,d,1,0 post-output (the last column is the position number after grouping on first three fields and reverse sorting on last

Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread fahad shah
Scheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) On Sun, Oct 18, 2015 at 11:17 PM, Jeff Zhang <zjf...@gmail.

Re: best way to generate per key auto increment numerals after sorting

2015-10-19 Thread fahad shah
? best fahad On Mon, Oct 19, 2015 at 10:45 AM, Davies Liu <dav...@databricks.com> wrote: > What's the issue with groupByKey()? > > On Mon, Oct 19, 2015 at 1:11 AM, fahad shah <sfaha...@gmail.com> wrote: >> Hi >> >> I wanted to ask whats the best way to ach

Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread fahad shah
gt; On Sun, Oct 18, 2015 at 10:42 PM, fahad shah <sfaha...@gmail.com> wrote: >> Hi >> >> I am trying to do pair rdd's, group by the key assign id based on key. >> I am using Pyspark with spark 1.3, for some reason, I am getting this >> error that I am unable to

pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-18 Thread fahad shah
Hi I am trying to do pair rdd's, group by the key assign id based on key. I am using Pyspark with spark 1.3, for some reason, I am getting this error that I am unable to figure out - any help much appreciated. Things I tried (but to no effect), 1. make sure I am not doing any conversions on