junrao commented on PR #14111: URL: https://github.com/apache/kafka/pull/14111#issuecomment-1659045347
@AndrewJSchofield : Looking at the code a bit more closely. I am not sure that the exponential backoff logic is added properly for the common failure cases in this PR. On the producer side, this is the loop when there is a leader change. ``` Sender call RecordAccumulator.partitionReady() to check if the batch need to backoff call completeBatch() --> on retriable error, reenqueueBatch() --> on metadata error, trigger updateMetadata() ``` If the metadata propagation is delayed, the Sender will stay in the above loop for multiple iterations. This PR only uses retryBackoffMax when the metadata update fails. However, in the common case, the metadata request won't fail and the issue is that the metadata is stale. So, it seems that we need to change RecordAccumulator.partitionReady() to check retryBackoffMax and implement the exponential backoff logic when the batch is re-enqueued. On the consumer side, there is a similar loop when there is a leader epoch bump. ``` KafkaConsumer.poll() call sendFetches() call pollForFetches() --> call AbstractFetch.handleInitializeCompletedFetchErrors --> requestMetadataUpdate() ``` Again, if the metadata propagation is delayed, the consumer poll() call will stay in the above loop for multiple iterations. This PR only uses retryBackoffMax when the metadata update fails. However, in the common case, the metadata request won't fail and the issue is that the metadata is stale. So, it seems that we need to add the exponential backoff logic to the above loop. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org