Hi Andrew and team, Congrats on the KIP passing. The design is really solid and much needed for the "Queues for Kafka" roadmap. I've been tied up, but finally had a chance to look at the implementation path for share groups and wanted to flag a few "day 2" operational risks. In my experience with high-throughput pipelines, these are the edge cases that usually lead to 2 AM outages if the broker-side logic isn't tightened up before GA.
1. Coordinator Failover & Duplicates The KIP admits that DLQ writes and state topic updates aren't atomic, meaning a coordinator failover (and PID reset) will cause duplicates. For anyone in finance or regulated industries, this breaks the 1:1 audit trail we rely on for compliance. This is a critical gap. We need a clear plan for deduplication during the coordinator recovery path. 2. Handling a Stuck ARCHIVING State If the DLQ topic goes offline or hits a leader election, we can't let records sit in ARCHIVING indefinitely. Without a configurable errors.deadletterqueue.write.timeout.ms, records could stay stuck during a sustained outage, creating unbounded memory pressure. I'd suggest a fall-through to ARCHIVED with a logged error to keep the system alive if the DLQ is unreachable. 3. Bounded Retries on the Broker The KIP mentions retrying on metadata/leadership issues but doesn't specify a limit. I'd propose a new config — errors.deadletterqueue.write.retries to provide a clean exit condition. Without a cap, a total partition failure could trigger an indefinite retry loop, wasting broker I/O and CPU. 4. Circuit Breaker for Systemic Failures This is the most critical point for me. If a downstream service dies, the share group will hit the delivery limit for every message, effectively draining the main topic into the DLQ in minutes. This kills message order and makes re-processing a nightmare. I'd propose a threshold if >20% of messages hit the DLQ in a rolling window, the group should PAUSE. It's always safer to stop the group than to dump the whole topic. 5. Mandatory Disposition Headers Since the broker already knows if a record failed due to MAX_DELIVERY_ATTEMPTS_REACHED vs. an explicit CLIENT_REJECTED NACK, we should make that a mandatory _dlq.errors.disposition header. Without it, operators can't distinguish a poison pill from a systemic timeout without digging through broker logs. 6. DLQ Ownership Check We should add a check at the coordinator level to ensure a DLQ topic isn't shared by multiple groups. Cross-contamination makes the DLQ useless for debugging if you're seeing failures from unrelated applications in the same stream. I'm particularly interested in your thoughts on the circuit breaker and the write timeouts, as those seem like the biggest stability risks at scale. Happy to help spec either of these out if the team finds them worthwhile. Best regards, Vaquar Khan *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/ *Book *- https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true *GitBook*-https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/ *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan *github*-https://github.com/vaquarkhan
