Re: Spark Streaming: DStream - zipWithIndex

Tathagata Das Thu, 28 Aug 2014 12:51:40 -0700

But then if you want to generate ids that are unique across ALL the records
that you are going to see in a stream (which can be potentially infinite),
then you definitely need a number space larger than long :)


TD


On Thu, Aug 28, 2014 at 12:48 PM, Soumitra Kumar <kumar.soumi...@gmail.com>
wrote:

> Yes, that is an option.
>
> I started with a function of batch time, and index to generate id as long.
> This may be faster than generating UUID, with added benefit of sorting
> based on time.
>
> ----- Original Message -----
> From: "Tathagata Das" <tathagata.das1...@gmail.com>
> To: "Soumitra Kumar" <kumar.soumi...@gmail.com>
> Cc: "Xiangrui Meng" <men...@gmail.com>, user@spark.apache.org
> Sent: Thursday, August 28, 2014 2:19:38 AM
> Subject: Re: Spark Streaming: DStream - zipWithIndex
>
>
> If just want arbitrary unique id attached to each record in a dstream (no
> ordering etc), then why not create generate and attach an UUID to each
> record?
>
>
>
>
>
> On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar < kumar.soumi...@gmail.com
> > wrote:
>
>
>
> I see a issue here.
>
>
> If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG.
>
>
> I wish there was DStream mapPartitionsWithIndex.
>
>
>
>
>
> On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng < men...@gmail.com > wrote:
>
>
> You can use RDD id as the seed, which is unique in the same spark
> context. Suppose none of the RDDs would contain more than 1 billion
> records. Then you can use
>
> rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid)
>
> Just a hack ..
>
> On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
>
>
> < kumar.soumi...@gmail.com > wrote:
> > So, I guess zipWithUniqueId will be similar.
> >
> > Is there a way to get unique index?
> >
> >
> > On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng < men...@gmail.com >
> wrote:
> >>
> >> No. The indices start at 0 for every RDD. -Xiangrui
> >>
> >> On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
> >> < kumar.soumi...@gmail.com > wrote:
> >> > Hello,
> >> >
> >> > If I do:
> >> >
> >> > DStream transform {
> >> > rdd.zipWithIndex.map {
> >> >
> >> > Is the index guaranteed to be unique across all RDDs here?
> >> >
> >> > }
> >> > }
> >> >
> >> > Thanks,
> >> > -Soumitra.
> >
> >
>
>
>

Re: Spark Streaming: DStream - zipWithIndex

Reply via email to