wangyum opened a new issue, #15336: URL: https://github.com/apache/iceberg/issues/15336
### Proposed Change ### Problem In Flink CDC workloads, the same row keys are often deleted multiple times within a single checkpoint: - Each UPDATE generates DELETE + INSERT in CDC stream - Flink processes records one-by-one without batch deduplication opportunity - Results in 5-10x more delete records than necessary - Creates excessive small delete files, causing NameNode RPC pressure ### Solution Add checkpoint-scoped deduplication caches to skip redundant delete operations: - `pendingDeleteKeys`: Deduplicates `deleteKey()` calls (key-only deletes) - `pendingDeleteRows`: Deduplicates `delete()` calls (full-row deletes) - Caches are cleared when writer closes (memory bounded by checkpoint size) - Check-before-copy optimization minimizes object allocations ### Proposal document _No response_ ### Specifications - [ ] Table - [ ] View - [ ] REST - [ ] Puffin - [ ] Encryption - [ ] Other -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
