[GitHub] [hudi] nsivabalan commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

GitBox Sun, 03 Jan 2021 20:10:08 -0800


nsivabalan commented on issue #2338:
URL: https://github.com/apache/hudi/issues/2338#issuecomment-753746016



   @so-lazy : I am looping in @bvaradar to help you out here. But in the mean 
time, some context around Global_Bloom. Hudi has two kinds of indexes, regular 
and global. in regular bloom, all record keys within a partition are unique, 
but there could be same record key across diff partitions. Within same 
partition, hudi will take care of updating the records based on record keys and 
will serve you only the latest snapshot for every record key of interest. 
   Where as in Global versions, record keys across the entire dataset is 
unique. in other words, there can't be same record key in different partitions. 
So, incase you insert a record, rec_1 in partition1 and later try to insert the 
same record(rec_1) to a diff partition, say partition2, Hudi by default will 
update the record in partition1. But there is a config which you can set, on 
which case, hudi will delete this record, rec1 of interest from partition1 and 
will insert to partition2. 
   This is the major difference between regular and global versions of index. 
Since in Global version, all partitions need to be looked up for all records, 
it is known to be less performant compared to regular index. So, unless you 
have this requirement, would suggest you to use regular indexes (BLOOM for ex). 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

Reply via email to