Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-15 Thread Mich Talebzadeh
Yes that is true. UUID only introduces uniqueness to the record. Some NoSql databases requires a primary key where UUID can be used. import java.util.UUID scala> var pk = UUID.randomUUID pk: java.util.UUID = 0d91e11a-f5f6-4b4b-a120-8c46a31dad0bscala> pk = UUID.randomUUID pk: java.util.UUID =

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-15 Thread Felix Kizhakkel Jose
Thank you so much for the insights. @Mich Talebzadeh Really appreciate your detailed examples. @Jungtaek Lim I see your point. I am thinking of having a mapping table with UUID to incremental ID and leverage range pruning etc on a large dataset. @sebastian I have to check how to do something like

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Jungtaek Lim
Theoretically, the composed value of batchId + monotonically_increasing_id() would achieve the goal. The major downside is that you'll need to deal with "deduplication" of output based on batchID as monotonically_increasing_id() is indeterministic. You need to ensure there's NO overlap on output

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Mich Talebzadeh
Sorry a correction regarding creating incrementing ID in Pyspark >>> df = spark.range(1,5) >>> from pyspark.sql.window import Window as W >>> from pyspark.sql import functions as F >>> df = df.withColumn("idx", F.monotonically_increasing_id()) >>> Wspec = W.orderBy("idx") >>> df.withColumn("idx",

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Sebastian Piu
If you want them to survive across jobs you can use snowflake IDs or similar ideas depending on your use case On Tue, 13 Jul 2021, 9:33 pm Mich Talebzadeh, wrote: > Meaning as a monolithically incrementing ID as in Oracle sequence for each > record read from Kafka. adding that to your

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Mich Talebzadeh
Meaning as a monolithically incrementing ID as in Oracle sequence for each record read from Kafka. adding that to your dataframe? If you do Structured Structured Streaming in microbatch mode, you will get what is known as BatchId result = streamingDataFrame.select( \

How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Felix Kizhakkel Jose
Hello, I am using Spark Structured Streaming to sink data from Kafka to AWS S3. I am wondering if its possible for me to introduce a uniquely incrementing identifier for each record as we do in RDBMS (incrementing long id)? This would greatly benefit to range prune while reading based on this ID.