xushiyan opened a new pull request, #8344:
URL: https://github.com/apache/hudi/pull/8344

   ### Change Logs
   
   When using global index (bloom or simple), and update partition is set to 
true. There is a chance where record is in p1 at the beginning, and later 
updated to p2, when updating to p3 and compaction not yet happened, global 
index joined both old versions of the record in p1 and p2, and tagged 2 records 
to insert to p3. This sort of duplicates will reside in the dataset and won't 
be reconciled unless manually dedup the table.
   
   This patch ensure dedup happens within the indexing (tagging) phase.
   
   ### Impact
   
   Global index has an extra dedup step for some records, which may slow down 
the whole process if a lot partition updates happen. In most scenarios, this is 
rare and perf impact is negligible.
   
   ### Risk level (write none, low medium or high below)
   
   Medium
   
   ### Documentation Update
   
   - [ ] New config `hoodie.global.index.dedup.parallelism`
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to