t3hw commented on PR #15710: URL: https://github.com/apache/iceberg/pull/15710#issuecomment-4106491280
Thanks for the PR, @koodin9 — and for the feedback on #15651 that identified the data loss scenario. That prompted a full redesign on my end. Your core approach here is exactly right, and I've built additional hardening on top of the same foundation. Would love to collaborate — here's what I've added: - **Per-table group commits**: tables in parallel, commitId groups sequential per table — stale failure only blocks its own table - **Selective buffer draining**: only successfully committed envelopes removed; failed groups retry next cycle - **Error escalation**: configurable blocking retries → failure policy (fail/non-blocking) → TTL eviction with orphaned file path logging - **Per-group offsets**: stale groups write their own envelope offsets (no null guards needed), preventing offset poisoning - **Partial offset advancement**: on partial success, consumer offsets advance to min uncommitted offset - **JMX monitoring**: `CommitStateMXBean` for stale group count, buffer size, eviction metrics - **8 new tests** covering group ordering, selective removal, and failure scenarios Happy to push commits on your branch or open a stacked PR — whatever works best. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
