Hi Vaquar,
Thanks for your interest in KIP-1191.

1) As the KIP states, the DLQ writes are not entirely atomic. I do take the 
point
that this might not be adequate for highly regulated industries. However, it is
acceptable to state such a limitation in a KIP and then a follow-on KIP can
be used to tighten up the semantics if the community feels the need. The
provision of a DLQ mechanism for share groups is a major enhancement
even with this proviso.

KIP-1289 is also going to be important for users who care deeply about
atomicity. That one is only in the early stages of discussion, but it will bring
transactional acknowledgement for share groups. I expect transactional
DLQ writes could build upon that KIP.

2) You are not correct about the unbounded memory pressure. Archiving
records are considered in-flight and the number of in-flight records per
partition is limited already. So, a DLQ write problem will throttle delivery
of additional records, which is inconvenient but not fatal.

3) This is interesting but of course it takes us back in the direction
of breaking the 1:1 audit trail requirement you mentioned in (1). If we give
up after a bounded number of retries, what then?

4) The circuit breaker idea is potentially interesting. KIP-1249 proposes
the ability to pause delivery, so that might be a helpful building block.

5) A disposition header is also interesting. We’ll think about this. Given
that KIP-1191 is aiming for AK 4.4, there’s time for another micro-KIP
without affecting the intended schedule.

6) I disagree. Different organisations will have different policies for DLQs.
Some might want a single DLQ for the entire cluster, while others might want
a separate DLQ for each share group. The flexibility is intentional and there is
no right answer.


If you’re interested in progressing (4), I encourage you to contribute a KIP. Be
aware that doing so implies that you will be able to marshal the resources to
get the KIP implemented to production quality, and there would be a significant
amount of testing required. The team working on KIP-932 spent the majority of
the time between AK 4.1 and 4.2 testing. We had automated soak tests running
for months and progressively fixed many defects. Contributing a spec is not
sufficient by itself.

Thanks,
Andrew


> On 8 Mar 2026, at 08:02, vaquar khan <[email protected]> wrote:
>
> Hi Andrew and team,
>
> Congrats on the KIP passing. The design is really solid and much needed for
> the "Queues for Kafka" roadmap. I've been tied up, but finally had a chance
> to look at the implementation path for share groups and wanted to flag a
> few "day 2" operational risks. In my experience with high-throughput
> pipelines, these are the edge cases that usually lead to 2 AM outages if
> the broker-side logic isn't tightened up before GA.
>
> 1. Coordinator Failover & Duplicates
> The KIP admits that DLQ writes and state topic updates aren't atomic,
> meaning a coordinator failover (and PID reset) will cause duplicates. For
> anyone in finance or regulated industries, this breaks the 1:1 audit trail
> we rely on for compliance. This is a critical gap. We need a clear plan for
> deduplication during the coordinator recovery path.
>
> 2. Handling a Stuck ARCHIVING State
> If the DLQ topic goes offline or hits a leader election, we can't let
> records sit in ARCHIVING indefinitely. Without a configurable
> errors.deadletterqueue.write.timeout.ms, records could stay stuck during a
> sustained outage, creating unbounded memory pressure. I'd suggest a
> fall-through to ARCHIVED with a logged error to keep the system alive if
> the DLQ is unreachable.
>
> 3. Bounded Retries on the Broker
> The KIP mentions retrying on metadata/leadership issues but doesn't specify
> a limit. I'd propose a new config — errors.deadletterqueue.write.retries to
> provide a clean exit condition. Without a cap, a total partition failure
> could trigger an indefinite retry loop, wasting broker I/O and CPU.
>
> 4. Circuit Breaker for Systemic Failures
> This is the most critical point for me. If a downstream service dies, the
> share group will hit the delivery limit for every message, effectively
> draining the main topic into the DLQ in minutes. This kills message order
> and makes re-processing a nightmare. I'd propose a threshold  if >20% of
> messages hit the DLQ in a rolling window, the group should PAUSE. It's
> always safer to stop the group than to dump the whole topic.
>
> 5. Mandatory Disposition Headers
> Since the broker already knows if a record failed due to
> MAX_DELIVERY_ATTEMPTS_REACHED vs. an explicit CLIENT_REJECTED NACK, we
> should make that a mandatory _dlq.errors.disposition header. Without it,
> operators can't distinguish a poison pill from a systemic timeout without
> digging through broker logs.
>
> 6. DLQ Ownership Check
> We should add a check at the coordinator level to ensure a DLQ topic isn't
> shared by multiple groups. Cross-contamination makes the DLQ useless for
> debugging if you're seeing failures from unrelated applications in the same
> stream.
>
> I'm particularly interested in your thoughts on the circuit breaker and the
> write timeouts, as those seem like the biggest stability risks at scale.
> Happy to help spec either of these out if the team finds them worthwhile.
>
> Best regards,
> Vaquar Khan
> *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
> *Book *-
> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
> *GitBook*-https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
> *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
> *github*-https://github.com/vaquarkhan

Reply via email to