the-other-tim-brown commented on PR #7640:
URL: https://github.com/apache/hudi/pull/7640#issuecomment-1382053922

   > Hi @the-other-tim-brown I'm interested in this functionality and have some 
questions, if I understand correctly the UUID will be the same for the same set 
of values in columns that it's based on?
   > 
   > So this generator can't be used for generating a surrogate key (a standard 
practice in data warehousing) as key is derived from data? My understanding of 
keyless model is that record key is a surrogate key that's globally unique.
   > 
   > I'm wondering if there's something that does not allow to create globally 
unique ids via the key generator interface (maybe virtual keys support)? At the 
same time in context of this PR, what's the place of 
[UuidKeyGenerator](https://github.com/apache/hudi/blob/41a9986a7641f3232b1edd2a737fd4b7aa430dbf/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/UuidKeyGenerator.scala)?
 Could it be used to generate surrogate keys that are globally unique?
   
   Yes it is correct that the keys are not guaranteed to be unique here. The 
issue with using a random UUID for us was that we were using deltastreamer and 
if the dag ever retriggered we were seeing data generated with new random UUIDs 
which could cause the records to be written to different filegroups causing an 
issue with duplicate/lost data due to some internals of how Hudi works. 
@nsivabalan had some similar thoughts around other approaches, can you chime in 
here? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to