Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
> Cc: "Xiangrui Meng" , user@spark.apache.org > Sent: Thursday, August 28, 2014 2:19:38 AM > Subject: Re: Spark Streaming: DStream - zipWithIndex > > > If just want arbitrary unique id attached to each record in a dstream (no > ordering etc), then why not create gener

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Soumitra Kumar
uot;Xiangrui Meng" , user@spark.apache.org Sent: Thursday, August 28, 2014 2:19:38 AM Subject: Re: Spark Streaming: DStream - zipWithIndex If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? O

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar wrote: > I see a issue here. > > If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. > >

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
I see a issue here. If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. I wish there was DStream mapPartitionsWithIndex. On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng wrote: > You can use RDD id as the seed, which is unique in the same spark > context. Suppose none of the RDDs would con

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell
Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar wrote: > Thanks. > > Just to double check, rdd.id would be unique for a batch in a DStream? > > > On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng wrote: >> >> You can use RDD id as the seed, which is unique

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng wrote: > You can use RDD id as the seed, which is unique in the same spark > context. Suppose none of the RDDs would contain more than 1 billion > records. Then you can

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar wrote:

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng wrote: > No. The indices start at 0 for every RDD. -Xiangrui > > On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar > wrote: > > Hello, > > > > If I do: > > > > DStream

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar wrote: > Hello, > > If I do: > > DStream transform { > rdd.zipWithIndex.map { > > Is the index guaranteed to be unique across all RDDs here? > > } > } > > Thanks, > -Soumitra.

Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.