t3hw commented on PR #15651:
URL: https://github.com/apache/iceberg/pull/15651#issuecomment-4101467127

   I'd like to share some context on how this fix was developed and what 
alternatives were considered.
   
   **Disclaimer:** I used Claude extensively for the investigation, design, and 
implementation of this fix. I validated every conclusion against the code, but 
want to be transparent about the tooling.
   
   ---
   
   ### Approaches rejected
   
   After @koodin9's comment identified the data loss with 
`commitBuffer.clear()`, I evaluated several alternatives before settling on 
per-commitId group separation:
   
   | Approach | Why rejected |
   |----------|-------------|
   | Conditional clear (boolean flag) | Too coarse — successful partial → 
failed commit → clear discards legitimate late events. Same-sequence problem 
persists. |
   | Two-bucket stale/current split | Consecutive failures lump stale events 
into one RowDelta — equality deletes within the bucket share the same seq 
number. |
   | Iceberg Transaction API | No per-RowDelta offset validator 
(#14509/#14510), uncertain `txn.table()` pending state, all-or-nothing retry. |
   | Data-then-deletes split | Equality deletes are key-based, not cycle-aware 
— cycle 1's deletes would remove cycle 2's data too. |
   | Position delete injection | `BaseRowDelta` validation prevents referencing 
files added in the same RowDelta. |
   
   ---
   
   ### Future enhancement: Worker-side rollforward
   
   The current fix handles stale events at the Coordinator. A cleaner long-term 
approach would eliminate them at the source.
   
   Today, Workers fire-and-forget `DataWritten` — they don't know if it was 
committed. If Workers consumed `CommitComplete` events (currently ignored in 
`Worker.receive()`) and tracked pending writes, they could detect uncommitted 
work when the next `StartCommit` arrives. On detection, the Worker re-seeks to 
the uncommitted source offsets and re-processes under the new commitId — 
producing a fresh `DataWritten` instead of leaving a stale one.
   
   This eliminates stale events entirely, removing the need for per-group 
ordering, retry escalation, and TTL eviction. It's deferred because it touches 
Worker, Channel, SinkWriter, and the control topic protocol — a scope that 
warrants its
   own design discussion.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to