danny0405 commented on issue #7897:
URL: https://github.com/apache/hudi/issues/7897#issuecomment-1426629317

   > I noticed that for each new record I append I had parquet file,so, first 
parquet has the first record, then when i insert new row a second parquet file 
created with both records, and when I insert for the third time a third parquet 
file is created with the 3 rows and when I update any of them I have a log file 
contains the update, and after number of appends the parqeut files compacted 
into one parquet file(the newest parquet file is kept (which has the three 
records appended) however , the other two parquet files are removed. 
   
   This is actually how the `BLOOM_FILTER` index works, all the inserts are 
written into a new FileSlice, only delta updates are written into logs.(Because 
you know, for UPDATEs, Hudi needs to know where its old records are located). 
And there are also small file/fileSlice strategy here so that things are kind 
of more complex, like you have perceived that new records are written into the 
same file group.
   
   The rt view would merge all the base parquet and delta logs so that the 
result is correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to