xushiyan opened a new pull request, #8344: URL: https://github.com/apache/hudi/pull/8344
### Change Logs When using global index (bloom or simple), and update partition is set to true. There is a chance where record is in p1 at the beginning, and later updated to p2, when updating to p3 and compaction not yet happened, global index joined both old versions of the record in p1 and p2, and tagged 2 records to insert to p3. This sort of duplicates will reside in the dataset and won't be reconciled unless manually dedup the table. This patch ensure dedup happens within the indexing (tagging) phase. ### Impact Global index has an extra dedup step for some records, which may slow down the whole process if a lot partition updates happen. In most scenarios, this is rare and perf impact is negligible. ### Risk level (write none, low medium or high below) Medium ### Documentation Update - [ ] New config `hoodie.global.index.dedup.parallelism` ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org