Hi Andrew, Thanks for the KIP. I have a question about broker configuration.
PY00: Would you consider mentioning the update mode for errors.deadletterqueue.topic.name.prefix and errors.deadletterqueue.auto.create.topics.enable are cluster-wide? Clarifying that these values must be consistent across the cluster (or updated dynamically as a cluster default) would help preventing inconsistent values among brokers. Thanks, PoAn > On Jan 8, 2026, at 6:18 PM, Andrew Schofield <[email protected]> wrote: > > Hi Shekhar, > Thanks for your comment. > > If the leader of the DLQ topic-partition changes as we are trying to write to > it, > then the code will need to cope with this. > > If the leader of the share-partition changes, we do not need special > processing. > If the transition to ARCHIVED is affected by a share-partition leadership > change, > the new leader will be responsible for the state transition. For example, if > a consumer > has rejected a record, a leadership change will cause the rejection to fail, > and the > record will be delivered again. This new delivery attempt will be performed > by the > new leader, and if this delivery attempt results in a rejection, the new > leader will > be responsible for initiating the DLQ write. > > Hope this makes sense, > Andrew > > On 2026/01/03 15:02:31 Shekhar Prasad Rajak via dev wrote: >> Hi, >> If leader changes during DLQ write, or a share partition leader changes, >> the partition is marked FENCED and in-memory cache state is lost, I think we >> need to add those cases as well. >> Ref >> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartitionManager.java#L857 >> >> >> >> Regards,Shekhar >> >> >> >> On Monday 29 December 2025 at 11:53:20 pm GMT+5:30, Andrew Schofield >> <[email protected]> wrote: >> >> Hi Abhinav, >> Thanks for your comments. >> >> AD01: Even if we were to allow the client to write to the DLQ topic, >> it would not be sufficient for situations in which the problem is one >> that the client cannot handle. So, my view is that it's preferable to >> use the same mechanism for all DLQ topic writes, regardless of >> whether the consumer initiated the process by rejecting a >> record or not. >> >> AD02: I have added a metric for counting failed DLQ topic produce >> requests per group. The KIP does say that the broker logs an >> error when it fails to produce to the DLQ topic. >> >> Thanks, >> Andrew >> >> On 2025/12/16 10:38:39 Abhinav Dixit via dev wrote: >>> Hi Andrew, >>> Thanks for this KIP. I have a couple of questions - >>> >>> AD01: From an implementation perspective, why can't we create/write records >>> to the DLQ topic from the client? Why do we want to do it from the broker? >>> As far as I understand, archiving the record on the share partition and >>> writing records to DLQ are independent? As you've mentioned in the KIP, "It >>> is possible in rare situations that more than one DLQ record could be >>> written for a particular undeliverable record", won't we minimize these >>> scenarios (by eliminating the dependency on persister write state result) >>> by writing records to the DLQ from the client? >>> >>> AD02: I agree with AM01 that we should emit a metric which can report the >>> count of failures of writing records to DLQ topic which an application >>> developer can monitor. If we are logging an error, maybe we should log the >>> count of such failures periodically? >>> >>> Regards, >>> Abhinav Dixit >>> >>> On Fri, Dec 12, 2025 at 3:08 AM Apoorv Mittal <[email protected]> >>> wrote: >>> >>>> Hi Andrew, >>>> Thanks for the much needed enhancement for SHare Groups. Some questions: >>>> >>>> AM1: The KIP states that in case of some failure "the broker will log an >>>> error", how an application developer will utilize this information and know >>>> about any such occurrences? Should we emit a metric which can report the >>>> count of such failures which an application developer can monitor? >>>> >>>> AM2: Today records can go to Archived state either when exceeded the >>>> delivery limit or explicitly rejected by the client. I am expecting the >>>> records will be written to dlq topic only in the former case i.e. when >>>> exceeded the delivery limit, that's what KIP explains. If yes, then can't >>>> there be a failure handling in the client which on serialization or other >>>> issues want to reject the message explicitly to be placed on dlq? Should we >>>> have a config which governs this behaviour i.e. if enabled then any >>>> explicitly rejected record from client will also go to dlq? >>>> >>>> AM3: I read your response on the thread related to the tricky part of ACL >>>> for DLQ topics and I have a question in the similar area. The KIP defines a >>>> config "errors.deadletterqueue.auto.create.topics.enable" which if enabled >>>> then broker can create the topic automatically using given other dlq topic >>>> params. If a new dlq topic is created then what basic permissions should be >>>> applied so the application developer can access? Should we provide >>>> capability to create dlq topics automatically or should restrict that and >>>> enforce it to be created by the application owner? By latter we know the >>>> application owner has access to the dlq topic already. >>>> >>>> AM4: For the "errors.deadletterqueue.topic.name.prefix", I am expecting >>>> that this applies well for auto created dlq topics. But how do we enforce >>>> the prefix behaviour when the application developer provides the dlq topic >>>> name in group configuration? Will there be a check while setting the group >>>> configuration "errors.deadletterqueue.topic.name" as per broker expected >>>> prefix? >>>> >>>> Regards, >>>> Apoorv Mittal >>>> >>>> >>>> On Wed, Dec 10, 2025 at 5:59 PM Federico Valeri <[email protected]> >>>> wrote: >>>> >>>>> Hi Andrew, a few comments/questions from me: >>>>> >>>>> FV00: The KIP says "copying of the original record data into the DLQ >>>>> is controlled by two configurations", but I only see the client side >>>>> configuration in the latest revision. >>>>> >>>>> FV01: The KIP says: "When an undeliverable record transitions to the >>>>> Archived state for such a group, a record is written onto the DLQ >>>>> topic". Later on it mentions a new "Archiving" state. Can you clarify >>>>> the state transition when sending a record to a DLQ? >>>>> >>>>> FV02: Is the new state required to ensure that the DLQ record is >>>>> eventually written in case of the Share Coordinator failover? >>>>> >>>>> Thanks, >>>>> Fede >>>>> >>>>> >>>>> On Tue, Dec 2, 2025 at 7:19 PM Andrew Schofield <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> I'd like to bump this discussion thread for adding DLQs to share >>>> groups. >>>>>> >>>>>> >>>>> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1191%3A+Dead-letter+queues+for+share+groups >>>>>> >>>>>> Thanks, >>>>>> Andrew >>>>>> >>>>>> On 2025/10/16 19:02:48 Andrew Schofield wrote: >>>>>>> Hi Chia-Ping, >>>>>>> Apologies for not responding to your comments. I was having email >>>>> problems >>>>>>> and I’ve only just noticed the unanswered comments. Also, this is >>>> not a >>>>>>> direct reply. >>>>>>> >>>>>>>>> chia00: How can we specify the number of partitions and the >>>>> replication factor >>>>>>> when `errors.deadletterqueue.auto.create.topics.enable` is set to >>>>> true? >>>>>>> >>>>>>> Personally, I prefer to make people create their DLQ topics manually, >>>>> but I take the >>>>>>> point. In order to give full flexibility, the list of configs you >>>> need >>>>> is quite long including >>>>>>> min.isr and compression. For consistency with Kafka Connect sink >>>>> connectors, I >>>>>>> could add `errors.deadletterqueue.topic.replication.factor` but >>>> that's >>>>> the only >>>>>>> additional config provided by Kafka Connect. Is that worthwhile? I >>>>> suggest not. >>>>>>> >>>>>>> The DLQ topic config in this KIP is broker-level config, while it's >>>>> connector-level >>>>>>> config for Kafka Connect. So, my preference is to just have one >>>>> broker-level config >>>>>>> for auto-creation on/off, and auto-create with the cluster's topic >>>>> defaults. If anything >>>>>>> more specific is required, the administrator can create the DLQ topic >>>>> themselves with >>>>>>> their preferences. Let me know what you think. >>>>>>> >>>>>>>>> chia01: Should the error stack trace be included in the message >>>>> headers, >>>>>>> similar to what's done in KIP-298? >>>>>>> >>>>>>> In KIP-298, the code deciding to write a message to the DLQ is >>>> running >>>>> in the >>>>>>> Kafka Connect task and an exception is readily available. In this >>>> KIP, >>>>> the code writing >>>>>>> to the DLQ is running in the broker and it doesn't have any detail >>>>> about why the >>>>>>> record is being DLQed. I think that actually the >>>>> __dlq.errors.exception.* headers >>>>>>> are not feasible without allowing the application to provide >>>>> additional error context. >>>>>>> That might be helpful one day, but that's extending this KIP more >>>> than >>>>> I intend. >>>>>>> I have removed these headers from the KIP. >>>>>>> >>>>>>>>> chia02: Why does `errors.deadletterqueue.copy.record.enable` have >>>>> different >>>>>>> default values at the broker level and group level? >>>>>>> >>>>>>> I want the group administrator to be able to choose whether to copy >>>>> the payloads. >>>>>>> I was also thinking that it would be a good idea if the cluster >>>>> administrator could >>>>>>> prevent this across the cluster, but I've changed my mind and I've >>>>> removed it. >>>>>>> >>>>>>> Maybe a better idea would simply to have a broker config >>>>>>> `group.share.errors.deadletterqueue.enable` to turn DLQ on/off. The >>>>> other >>>>>>> broker configs in this KIP do not start `group.share.` because >>>> they're >>>>> intended >>>>>>> for other DLQ uses by the broker in future. >>>>>>> >>>>>>> Note that although share.version=2 is required to enable DLQ, this >>>>> isn't a suitable >>>>>>> long-term switch because we might have share.version > 2 due to >>>>> another future >>>>>>> enhancement. >>>>>>> >>>>>>>>> chia03: Does the broker log an error for every message if the DLQ >>>>> topic fails to be created? >>>>>>> >>>>>>> No, that seems excessive and likely to flood the logs. I would >>>>> implement something like >>>>>>> no more than one log per minute, per share-partition. That would be >>>>> annoying enough to >>>>>>> fix without being catastrophically verbose. >>>>>>> >>>>>>> Of course, if the group config `errors.deadletterqueue.topic.name` >>>>> has a value which >>>>>>> does not satisfy the broker config >>>>> `errors.deadletterqueue.topic.name.prefix`, it will >>>>>>> be considered a config error and the DLQ will not be used. >>>>>>> >>>>>>>>> chia04: Have you consider adding metrics for the DLQ? >>>>>>> >>>>>>> Yes, that is a good idea. I've added some metrics to the KIP. Please >>>>> take a look. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Andrew >>>>>>> >>>>>>>> On 4 Aug 2025, at 11:30, Andrew Schofield < >>>>> [email protected]> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> Thanks for your comments on the KIP and sorry for the delay in >>>>> responding. >>>>>>>> >>>>>>>> D01: Authorisation is the area of this KIP that I think is most >>>>> tricky. The reason that I didn't implement specific >>>>>>>> ACLs for DLQs because I was not convinced they would help. So, if >>>>> you have a specific idea in mind, please >>>>>>>> let me know. This is the area that I'm least comfortable with in >>>> the >>>>> KIP. >>>>>>>> >>>>>>>> I suppose maybe to set the DLQ name for a group, you could need a >>>>> higher level of authorisation >>>>>>>> than just ALTER_CONFIGS on the GROUP. But what I settled with in >>>> the >>>>> KIP was that DLQ topics >>>>>>>> all start with the same prefix, defaulting to "dlq.", and that the >>>>> topics do not automatically create. >>>>>>>> >>>>>>>> D02: I can see that. I've added a config which I've called >>>>> errors.deadletterqueue.auto.create.topics.enable >>>>>>>> just to have a consistent prefix on all of the config names. Let me >>>>> know what you think. >>>>>>>> >>>>>>>> D03: I've added some text about failure scenarios when attempting >>>> to >>>>> write records to the DLQ. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Andrew >>>>>>>> ________________________________________ >>>>>>>> From: isding_l <[email protected]> >>>>>>>> Sent: 16 July 2025 04:18 >>>>>>>> To: dev <[email protected]> >>>>>>>> Subject: Re: [DISCUSS]: KIP-1191: Dead-letter queues for share >>>> groups >>>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> Thanks for the nice KIP, This KIP design for introducing >>>> dead-letter >>>>> queues (DLQs) for Share Groups is generally clear and reasonable, >>>>> addressing the key pain points of handling "poison message". >>>>>>>> >>>>>>>> >>>>>>>> D01: Should we consider implementing independent ACL configurations >>>>> for DLQs? This would enable separate management of DLQ topic read/write >>>>> permissions from source topics, preventing privilege escalation attacks >>>> via >>>>> "poison message" + DLQ mechanisms. >>>>>>>> >>>>>>>> >>>>>>>> D02: While disabling automatic DLQ topic creation is justifiable >>>> for >>>>> security, it creates operational overhead in automated deployments. Can >>>> we >>>>> introduce a configuration parameter auto.create.dlq.topics.enable to >>>> govern >>>>> this behavior? >>>>>>>> >>>>>>>> >>>>>>>> D03: How should we handle failure scenarios when brokers attempt to >>>>> write records to the DLQ? >>>>>>>> ---- Replied Message ---- >>>>>>>> | From | Andrew Schofield<[email protected]> | >>>>>>>> | Date | 07/08/2025 17:54 | >>>>>>>> | To | [email protected]<[email protected]> | >>>>>>>> | Subject | [DISCUSS]: KIP-1191: Dead-letter queues for share >>>> groups >>>>> | >>>>>>>> Hi, >>>>>>>> I'd like to start discussion on KIP-1191 which adds dead-letter >>>>> queue support for share groups. >>>>>>>> Records which cannot be processed by consumers in a share group can >>>>> be automatically copied >>>>>>>> onto another topic for a closer look. >>>>>>>> >>>>>>>> KIP: >>>>> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1191%3A+Dead-letter+queues+for+share+groups >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Andrew >>>>>>> >>>>>>> >>>>> >>>> >>> >>
