samserpoosh commented on issue #9143:
URL: https://github.com/apache/hudi/issues/9143#issuecomment-1639102139

   FWIW, I'm seeing an identical issue on my end. The `before` is **not** 
populated correctly and all fields have **default** values instead. So as 
Sydney pointed out, this leads to having a **wrong** partition-key which makes 
DeltaStreamer unable to find the right partition and ultimately the right 
record to **deleted**.
   
   > or if there could be a workaround in Deltastreamer that allows it to 
delete the record without knowing what partition it is from.
   
   @sydneyhoran IIUC, when dealing with **partitioned datasets/Hudi Tables**, 
uniqueness is at a **partition level** as opposed to being global. Per Hudi 
[documentation](https://hudi.apache.org/docs/key_generation/):
   
   > In general, Hudi supports both partitioned and global indexes. For a 
dataset with partitioned index(which is most commonly used), each record is 
uniquely identified by a pair of record key and partition path. But for a 
dataset with global index, each record is uniquely identified by just the 
record key. There won't be any duplicate record keys across partitions.
   
   So I **think** since we're using partitioned datasets, global uniqueness 
does not exist and DeltaStreamer zero in on the partition and then record as 
opposed to global-lookup for the record. That's my understanding but we'll see 
Hudi team disagrees with this interpretation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to