t3hw commented on PR #15651: URL: https://github.com/apache/iceberg/pull/15651#issuecomment-4101467127
I'd like to share some context on how this fix was developed and what alternatives were considered. **Disclaimer:** I used Claude extensively for the investigation, design, and implementation of this fix. I validated every conclusion against the code, but want to be transparent about the tooling. --- ### Approaches rejected After @koodin9's comment identified the data loss with `commitBuffer.clear()`, I evaluated several alternatives before settling on per-commitId group separation: | Approach | Why rejected | |----------|-------------| | Conditional clear (boolean flag) | Too coarse — successful partial → failed commit → clear discards legitimate late events. Same-sequence problem persists. | | Two-bucket stale/current split | Consecutive failures lump stale events into one RowDelta — equality deletes within the bucket share the same seq number. | | Iceberg Transaction API | No per-RowDelta offset validator (#14509/#14510), uncertain `txn.table()` pending state, all-or-nothing retry. | | Data-then-deletes split | Equality deletes are key-based, not cycle-aware — cycle 1's deletes would remove cycle 2's data too. | | Position delete injection | `BaseRowDelta` validation prevents referencing files added in the same RowDelta. | --- ### Future enhancement: Worker-side rollforward The current fix handles stale events at the Coordinator. A cleaner long-term approach would eliminate them at the source. Today, Workers fire-and-forget `DataWritten` — they don't know if it was committed. If Workers consumed `CommitComplete` events (currently ignored in `Worker.receive()`) and tracked pending writes, they could detect uncommitted work when the next `StartCommit` arrives. On detection, the Worker re-seeks to the uncommitted source offsets and re-processes under the new commitId — producing a fresh `DataWritten` instead of leaving a stale one. This eliminates stale events entirely, removing the need for per-group ordering, retry escalation, and TTL eviction. It's deferred because it touches Worker, Channel, SinkWriter, and the control topic protocol — a scope that warrants its own design discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
