Yes that is true. UUID only introduces uniqueness to the record. Some NoSql
databases requires a primary key where UUID can be used.
import java.util.UUID
scala> var pk = UUID.randomUUID
pk: java.util.UUID = 0d91e11a-f5f6-4b4b-a120-8c46a31dad0bscala>
pk = UUID.randomUUID
pk: java.util.UUID =
Thank you so much for the insights.
@Mich Talebzadeh Really appreciate your
detailed examples.
@Jungtaek Lim I see your point. I am thinking of having a mapping table
with UUID to incremental ID and leverage range pruning etc on a large
dataset.
@sebastian I have to check how to do something like
Theoretically, the composed value of batchId +
monotonically_increasing_id() would achieve the goal. The major downside is
that you'll need to deal with "deduplication" of output based on batchID
as monotonically_increasing_id() is indeterministic. You need to ensure
there's NO overlap on output
Sorry a correction regarding creating incrementing ID in Pyspark
>>> df = spark.range(1,5)
>>> from pyspark.sql.window import Window as W
>>> from pyspark.sql import functions as F
>>> df = df.withColumn("idx", F.monotonically_increasing_id())
>>> Wspec = W.orderBy("idx")
>>> df.withColumn("idx",
If you want them to survive across jobs you can use snowflake IDs or
similar ideas depending on your use case
On Tue, 13 Jul 2021, 9:33 pm Mich Talebzadeh,
wrote:
> Meaning as a monolithically incrementing ID as in Oracle sequence for each
> record read from Kafka. adding that to your
Meaning as a monolithically incrementing ID as in Oracle sequence for each
record read from Kafka. adding that to your dataframe?
If you do Structured Structured Streaming in microbatch mode, you will get
what is known as BatchId
result = streamingDataFrame.select( \
Hello,
I am using Spark Structured Streaming to sink data from Kafka to AWS S3. I
am wondering if its possible for me to introduce a uniquely incrementing
identifier for each record as we do in RDBMS (incrementing long id)?
This would greatly benefit to range prune while reading based on this ID.