Re: [I] [SUPPORT] Don't understand the result for bulk insert first and then upsert [hudi]

via GitHub Tue, 17 Sep 2024 21:42:30 -0700


bithw1 commented on issue #11950:
URL: https://github.com/apache/hudi/issues/11950#issuecomment-2357482621


   > First of all, could you tell me which version of hudi you are using?
   > 
   > > As known, bulk insert operation doesn't build index or do index loop 
during inserting.
   > 
   > In fact, this is not accurate. When writing a parquet file in Hudi, a 
Bloom filter based on the primary key is constructed and serialized into the 
footer.
   
   Thanks @BruceKellan for the help answer. I am using hudi 0.9.0，Spark 3.0.0
   
   
   Looks I have misunderstood here, I got following two conclusions,  please 
correct me if I am wrong. Thanks,@BruceKellan 
   
   1. Per your answer,during bulk insert, indices are indeed created and saved 
in the parquet footer,  and the indices  will take effect for the later upsert 
operation
   2.  I tried another experiment, bulk insert records that have duplicate 
record keys, I find that these duplicate records are all written into the 
table, it looks that during bulk insert, indices are created but are not used 
for de-duplicate through record key loop during bulk insert
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] Don't understand the result for bulk insert first and then upsert [hudi]

Reply via email to