wangyum opened a new issue, #15336:
URL: https://github.com/apache/iceberg/issues/15336

   ### Proposed Change
   
   ### Problem
   In Flink CDC workloads, the same row keys are often deleted multiple times 
within a single checkpoint:
   - Each UPDATE generates DELETE + INSERT in CDC stream
   - Flink processes records one-by-one without batch deduplication opportunity
   - Results in 5-10x more delete records than necessary
   - Creates excessive small delete files, causing NameNode RPC pressure
   
   ### Solution
   Add checkpoint-scoped deduplication caches to skip redundant delete 
operations:
   - `pendingDeleteKeys`: Deduplicates `deleteKey()` calls (key-only deletes)
   - `pendingDeleteRows`: Deduplicates `delete()` calls (full-row deletes)
   - Caches are cleared when writer closes (memory bounded by checkpoint size)
   - Check-before-copy optimization minimizes object allocations
   
   ### Proposal document
   
   _No response_
   
   ### Specifications
   
   - [ ] Table
   - [ ] View
   - [ ] REST
   - [ ] Puffin
   - [ ] Encryption
   - [ ] Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to