[jira] [Updated] (KAFKA-20058) Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing premature retries

sanghyeok An (Jira) Sat, 10 Jan 2026 04:32:09 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-20058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sanghyeok An updated KAFKA-20058:
---------------------------------
    Description: 
While investigating a flaky failure in 
ProducerIdManagerTest.testRetryBackoffOnNoResponse, 
I found a race in RPCProducerIdManager.maybeRequestNextBlock().

maybeRequestNextBlock() currently does:
 * sendRequest()
 * backoffDeadlineMs.set(NO_RETRY) (unconditional)

On the response path, handleUnsuccessfulResponse() does:
 * backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
 * requestInFlight.set(false)

 

Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs 
reset can run after handleUnsuccessfulResponse(), overwriting the newly-set 
backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent 
generateProducerId() call can re-send immediately, which may prefill 
nextProducerIdBlock earlier than expected and lead to test flakiness (and 
potentially unnecessary controller traffic).

 

In production, this race is less likely because the request/response path 
typically has higher latency than in the unit test (which simulates the 
controller response on a local executor). However, the code still has a 
correctness window where a newly set backoff deadline can be clobbered by an 
unconditional reset. Using compareAndSet to conditionally reset backoff 
preserves the intended behavior, avoids overwriting newer backoff values, and 
should have negligible performance impact (CAS is only executed on the request 
path, and contention should be rare). This also eliminates the observed test 
flakiness.

  was:
While investigating a flaky failure in 
ProducerIdManagerTest.testRetryBackoffOnNoResponse, I found a race in 
RPCProducerIdManager.maybeRequestNextBlock().

maybeRequestNextBlock() currently does:
 * sendRequest()
 * backoffDeadlineMs.set(NO_RETRY) (unconditional)

On the response path, handleUnsuccessfulResponse() does:
 * backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
 * requestInFlight.set(false)

 

Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs 
reset can run after handleUnsuccessfulResponse(), overwriting the newly-set 
backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent 
generateProducerId() call can re-send immediately, which may prefill 
nextProducerIdBlock earlier than expected and lead to test flakiness (and 
potentially unnecessary controller traffic).

 

In production, this race is less likely because the request/response path 
typically has higher latency than in the unit test (which simulates the 
controller response on a local executor). However, the code still has a 
correctness window where a newly set backoff deadline can be clobbered by an 
unconditional reset. Using compareAndSet to conditionally reset backoff 
preserves the intended behavior, avoids overwriting newer backoff values, and 
should have negligible performance impact (CAS is only executed on the request 
path, and contention should be rare). This also eliminates the observed test 
flakiness.


> Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing 
> premature retries
> -----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-20058
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20058
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: sanghyeok An
>            Assignee: sanghyeok An
>            Priority: Minor
>              Labels: transaction
>
> While investigating a flaky failure in 
> ProducerIdManagerTest.testRetryBackoffOnNoResponse, 
> I found a race in RPCProducerIdManager.maybeRequestNextBlock().
> maybeRequestNextBlock() currently does:
>  * sendRequest()
>  * backoffDeadlineMs.set(NO_RETRY) (unconditional)
> On the response path, handleUnsuccessfulResponse() does:
>  * backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
>  * requestInFlight.set(false)
>  
> Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs 
> reset can run after handleUnsuccessfulResponse(), overwriting the newly-set 
> backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent 
> generateProducerId() call can re-send immediately, which may prefill 
> nextProducerIdBlock earlier than expected and lead to test flakiness (and 
> potentially unnecessary controller traffic).
>  
> In production, this race is less likely because the request/response path 
> typically has higher latency than in the unit test (which simulates the 
> controller response on a local executor). However, the code still has a 
> correctness window where a newly set backoff deadline can be clobbered by an 
> unconditional reset. Using compareAndSet to conditionally reset backoff 
> preserves the intended behavior, avoids overwriting newer backoff values, and 
> should have negligible performance impact (CAS is only executed on the 
> request path, and contention should be rare). This also eliminates the 
> observed test flakiness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-20058) Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing premature retries

Reply via email to