Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: I see a issue here. If rdd.id is 1000 then rdd.id *

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Soumitra Kumar
...@gmail.com Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org Sent: Thursday, August 28, 2014 2:19:38 AM Subject: Re: Spark Streaming: DStream - zipWithIndex If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
kumar.soumi...@gmail.com Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org Sent: Thursday, August 28, 2014 2:19:38 AM Subject: Re: Spark Streaming: DStream - zipWithIndex If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate

Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks,

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote:

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell
Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
I see a issue here. If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. I wish there was DStream mapPartitionsWithIndex. On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the