[ https://issues.apache.org/jira/browse/KAFKA-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762921#comment-17762921 ]
Satish Duggana commented on KAFKA-9800: --------------------------------------- [~junrao] [~schofielaj] Are you planning to merge these changes to 3.6 branch? > [KIP-580] Client Exponential Backoff Implementation > --------------------------------------------------- > > Key: KAFKA-9800 > URL: https://issues.apache.org/jira/browse/KAFKA-9800 > Project: Kafka > Issue Type: New Feature > Reporter: Cheng Tan > Assignee: Andrew Schofield > Priority: Major > Labels: KIP-580, client > Fix For: 3.7.0 > > > Design: > The main idea is to bookkeep the failed attempt. Currently, the retry backoff > has two main usage patterns: > # Synchronous retires and blocking loop. The thread will sleep in each > iteration for retry backoff ms. > # Async retries. In each polling, the retries do not meet the backoff will > be filtered. The data class often maintains a 1:1 mapping to a set of > requests which are logically associated. (i.e. a set contains only one > initial request and only its retries.) > For type 1, we can utilize a local failure counter of a Java generic data > type. > For case 2, I already wrapped the exponential backoff/timeout util class in > my KIP-601 > [implementation|https://github.com/apache/kafka/pull/8683/files#diff-9ca2b1294653dfa914b9277de62b52e3R28] > which takes the number of attempts and returns the backoff/timeout value at > the corresponding level. Thus, we can add a new class property to those > classes containing retriable data in order to record the number of failed > attempts. > > Changes: > KafkaProducer: > # Produce request (ApiKeys.PRODUCE). Currently, the backoff applies to each > ProducerBatch in Accumulator, which already has an attribute attempts > recording the number of failed attempts. So we can let the Accumulator > calculate the new retry backoff for each bach when it enqueues them, to avoid > instantiate the util class multiple times. > # Transaction request (ApiKeys..*TXN). TxnRequestHandler will have a new > class property of type `Long` to record the number of attempts. > KafkaConsumer: > # Some synchronous retry use cases. Record the failed attempts in the > blocking loop. > # Partition request (ApiKeys.OFFSET_FOR_LEADER_EPOCH, ApiKeys.LIST_OFFSETS). > Though the actual requests are packed for each node, the current > implementation is applying backoff to each topic partition, where the backoff > value is kept by TopicPartitionState. Thus, TopicPartitionState will have the > new property recording the number of attempts. > Metadata: > # Metadata lives as a singleton in many clients. Add a new property > recording the number of attempts > AdminClient: > # AdminClient has its own request abstraction Call. The failed attempts are > already kept by the abstraction. So probably clean the Call class logic a bit. > Existing tests: > # If the tests are testing the retry backoff, add a delta to the assertion, > considering the existence of the jitter. > # If the tests are testing other functionality, we can specify the same > value for both `retry.backoff.ms` and `retry.backoff.max.ms` in order to make > the retry backoff static. We can use this trick to make the existing tests > compatible with the changes. > There're other common usages look like client.poll(timeout), where the > timeout passed in is the retry backoff value. We won't change these usages > since its underlying logic is nioSelector.select(timeout) and > nioSelector.selectNow(), which means if no interested op exists, the client > will block retry backoff milliseconds. This is an optimization when there's > no request that needs to be sent but the client is waiting for responses. > Specifically, if the client fails the inflight requests before the retry > backoff milliseconds passed, it still needs to wait until that amount of time > passed, unless there's a new request need to be sent. > -- This message was sent by Atlassian Jira (v8.20.10#820010)