Re: Spark Streaming: DStream - zipWithIndex

Soumitra Kumar Thu, 28 Aug 2014 12:49:46 -0700

Yes, that is an option.

I started with a function of batch time, and index to generate id as long. This 
may be faster than generating UUID, with added benefit of sorting based on time.


----- Original Message -----
From: "Tathagata Das" <tathagata.das1...@gmail.com>
To: "Soumitra Kumar" <kumar.soumi...@gmail.com>
Cc: "Xiangrui Meng" <men...@gmail.com>, user@spark.apache.org
Sent: Thursday, August 28, 2014 2:19:38 AM
Subject: Re: Spark Streaming: DStream - zipWithIndex


If just want arbitrary unique id attached to each record in a dstream (no 
ordering etc), then why not create generate and attach an UUID to each record? 





On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar < kumar.soumi...@gmail.com > 
wrote: 



I see a issue here. 


If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. 


I wish there was DStream mapPartitionsWithIndex. 





On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng < men...@gmail.com > wrote: 


You can use RDD id as the seed, which is unique in the same spark 
context. Suppose none of the RDDs would contain more than 1 billion 
records. Then you can use 

rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid) 

Just a hack .. 

On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar 


< kumar.soumi...@gmail.com > wrote: 
> So, I guess zipWithUniqueId will be similar. 
> 
> Is there a way to get unique index? 
> 
> 
> On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng < men...@gmail.com > wrote: 
>> 
>> No. The indices start at 0 for every RDD. -Xiangrui 
>> 
>> On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar 
>> < kumar.soumi...@gmail.com > wrote: 
>> > Hello, 
>> > 
>> > If I do: 
>> > 
>> > DStream transform { 
>> > rdd.zipWithIndex.map { 
>> > 
>> > Is the index guaranteed to be unique across all RDDs here? 
>> > 
>> > } 
>> > } 
>> > 
>> > Thanks, 
>> > -Soumitra. 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming: DStream - zipWithIndex

Reply via email to