Yes, that is an option. I started with a function of batch time, and index to generate id as long. This may be faster than generating UUID, with added benefit of sorting based on time.
----- Original Message ----- From: "Tathagata Das" <tathagata.das1...@gmail.com> To: "Soumitra Kumar" <kumar.soumi...@gmail.com> Cc: "Xiangrui Meng" <men...@gmail.com>, user@spark.apache.org Sent: Thursday, August 28, 2014 2:19:38 AM Subject: Re: Spark Streaming: DStream - zipWithIndex If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar < kumar.soumi...@gmail.com > wrote: I see a issue here. If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. I wish there was DStream mapPartitionsWithIndex. On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng < men...@gmail.com > wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar < kumar.soumi...@gmail.com > wrote: > So, I guess zipWithUniqueId will be similar. > > Is there a way to get unique index? > > > On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng < men...@gmail.com > wrote: >> >> No. The indices start at 0 for every RDD. -Xiangrui >> >> On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar >> < kumar.soumi...@gmail.com > wrote: >> > Hello, >> > >> > If I do: >> > >> > DStream transform { >> > rdd.zipWithIndex.map { >> > >> > Is the index guaranteed to be unique across all RDDs here? >> > >> > } >> > } >> > >> > Thanks, >> > -Soumitra. > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org