[jira] [Commented] (KAFKA-20077) Producer using transactions v1 could hang on flush upon retriable failures adding partitions to tx

sanghyeok An (Jira) Fri, 16 Jan 2026 15:28:08 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-20077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18052487#comment-18052487
 ]


sanghyeok An commented on KAFKA-20077:
--------------------------------------

[~lianetm] HI!

Are you planning to handle this yourself, or ask someone specific to look into 
it? If not, would it be okay if I take a look?

> Producer using transactions v1 could hang on flush upon retriable failures 
> adding partitions to tx 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-20077
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20077
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, producer 
>            Reporter: Lianet Magrans
>            Priority: Major
>
> We've seen some occurrences of producer.flush hanging indefinitely in 
> situations where a topic is deleted and the producer is using transactions v1 
> (not using 2pc)
> In the case where the producer has records in the buffer, and the topic 
> deletion happens right before adding the first partition to the transaction, 
> we could fall in a loop where the AddPartitionsToTx fails with a retriable 
> error, and is continuously retried. In this case, none of the timeouts 
> related to send, transactions or request seem to apply:
>  * [transaction.timeout.ms|http://transaction.timeout.ms/] -> not applied 
> because no partition has been added yet
>  * [delivery.timeout.ms|http://delivery.timeout.ms/] -> not applied because 
> the client does not attempt sending (where batching expiration applies) while 
> it's in a transactional request (i.e AddPartitionsToTx), early return here 
> [https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L333-L335])
>  * [request.timeout.ms|http://request.timeout.ms/] -> not applied because 
> it's not a request failure really, the high level operation too add 
> partitions is retried.
>  * 
> [default.api.timeout.ms|http://delivery.timeout.ms/] does not apply to the 
> producer.flush api by design (or to any produce request really)
> Client handing of retriable errors when adding partitions to tx (this would 
> be the case of UNKNOWN_TOPIC_OR_PARTITION when a topic is deleted):
> [https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1603-L1605]
> This only affects producers using tx v1, and it's solved with tx v2 
> (partitions not added to the tx separately, so delivery timeout checked on 
> send and applied, unblocking the flush operation).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-20077) Producer using transactions v1 could hang on flush upon retriable failures adding partitions to tx

Reply via email to