Lianet Magrans created KAFKA-20077:
--------------------------------------

             Summary: Producer using transactions v1 could hang on flush upon 
retriable failures adding partitions to tx 
                 Key: KAFKA-20077
                 URL: https://issues.apache.org/jira/browse/KAFKA-20077
             Project: Kafka
          Issue Type: Bug
          Components: clients, producer 
            Reporter: Lianet Magrans


We've seen some occurrences of producer.flush hanging indefinitely in 
situations where a topic is deleted and the producer is using transactions v1 
(not using 2pc)

In the case where the producer has records in the buffer, and the topic 
deletion happens right before adding the first partition to the transaction, we 
could fall in a loop where the AddPartitionsToTx fails with a retriable error, 
and is continuously retried. In this case, none of the timeouts related to 
send, transactions or request seem to apply:
 * [transaction.timeout.ms|http://transaction.timeout.ms/] -> not applied 
because no partition has been added yet
 * [delivery.timeout.ms|http://delivery.timeout.ms/] -> not applied because the 
client does not attempt sending (where batching expiration applies) while it's 
in a transactional request (i.e AddPartitionsToTx), early return here 
[https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L333-L335])
 * [request.timeout.ms|http://request.timeout.ms/] -> not applied because it's 
not a request failure really, the high level operation too add partitions is 
retried.
 * 
[default.api.timeout.ms|http://delivery.timeout.ms/] does not apply to the 
producer.flush api by design (or to any produce request really)

Client handing of retriable errors when adding partitions to tx (this would be 
the case of UNKNOWN_TOPIC_OR_PARTITION when a topic is deleted):

[https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1603-L1605]

This only affects producers using tx v1, and it's solved with tx v2 (partitions 
not added to the tx separately, so delivery timeout checked on send and 
applied, unblocking the flush operation).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to