Lianet Magrans created KAFKA-20077:
--------------------------------------
Summary: Producer using transactions v1 could hang on flush upon
retriable failures adding partitions to tx
Key: KAFKA-20077
URL: https://issues.apache.org/jira/browse/KAFKA-20077
Project: Kafka
Issue Type: Bug
Components: clients, producer
Reporter: Lianet Magrans
We've seen some occurrences of producer.flush hanging indefinitely in
situations where a topic is deleted and the producer is using transactions v1
(not using 2pc)
In the case where the producer has records in the buffer, and the topic
deletion happens right before adding the first partition to the transaction, we
could fall in a loop where the AddPartitionsToTx fails with a retriable error,
and is continuously retried. In this case, none of the timeouts related to
send, transactions or request seem to apply:
* [transaction.timeout.ms|http://transaction.timeout.ms/] -> not applied
because no partition has been added yet
* [delivery.timeout.ms|http://delivery.timeout.ms/] -> not applied because the
client does not attempt sending (where batching expiration applies) while it's
in a transactional request (i.e AddPartitionsToTx), early return here
[https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L333-L335])
* [request.timeout.ms|http://request.timeout.ms/] -> not applied because it's
not a request failure really, the high level operation too add partitions is
retried.
*
[default.api.timeout.ms|http://delivery.timeout.ms/] does not apply to the
producer.flush api by design (or to any produce request really)
Client handing of retriable errors when adding partitions to tx (this would be
the case of UNKNOWN_TOPIC_OR_PARTITION when a topic is deleted):
[https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1603-L1605]
This only affects producers using tx v1, and it's solved with tx v2 (partitions
not added to the tx separately, so delivery timeout checked on send and
applied, unblocking the flush operation).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)