If just want arbitrary unique id attached to each record in a dstream (no
ordering etc), then why not create generate and attach an UUID to each
record?
On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
I see a issue here.
If rdd.id is 1000 then rdd.id *
...@gmail.com
Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org
Sent: Thursday, August 28, 2014 2:19:38 AM
Subject: Re: Spark Streaming: DStream - zipWithIndex
If just want arbitrary unique id attached to each record in a dstream (no
ordering etc), then why not create generate and attach
kumar.soumi...@gmail.com
Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org
Sent: Thursday, August 28, 2014 2:19:38 AM
Subject: Re: Spark Streaming: DStream - zipWithIndex
If just want arbitrary unique id attached to each record in a dstream (no
ordering etc), then why not create generate
Hello,
If I do:
DStream transform {
rdd.zipWithIndex.map {
Is the index guaranteed to be unique across all RDDs here?
}
}
Thanks,
-Soumitra.
No. The indices start at 0 for every RDD. -Xiangrui
On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
Hello,
If I do:
DStream transform {
rdd.zipWithIndex.map {
Is the index guaranteed to be unique across all RDDs here?
}
}
Thanks,
So, I guess zipWithUniqueId will be similar.
Is there a way to get unique index?
On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:
No. The indices start at 0 for every RDD. -Xiangrui
On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
You can use RDD id as the seed, which is unique in the same spark
context. Suppose none of the RDDs would contain more than 1 billion
records. Then you can use
rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)
Just a hack ..
On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
Thanks.
Just to double check, rdd.id would be unique for a batch in a DStream?
On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote:
You can use RDD id as the seed, which is unique in the same spark
context. Suppose none of the RDDs would contain more than 1 billion
Yeah - each batch will produce a new RDD.
On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
Thanks.
Just to double check, rdd.id would be unique for a batch in a DStream?
On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote:
You can use RDD
I see a issue here.
If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG.
I wish there was DStream mapPartitionsWithIndex.
On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote:
You can use RDD id as the seed, which is unique in the same spark
context. Suppose none of the
10 matches
Mail list logo