sanghyeok An created KAFKA-20058:
------------------------------------

             Summary: Fix race condition on backoffDeadlineMs on 
RPCProducerIdManager causing premature retries
                 Key: KAFKA-20058
                 URL: https://issues.apache.org/jira/browse/KAFKA-20058
             Project: Kafka
          Issue Type: Bug
            Reporter: sanghyeok An
            Assignee: sanghyeok An


While investigating a flaky failure in 
ProducerIdManagerTest.testRetryBackoffOnNoResponse, I found a race in 
RPCProducerIdManager.maybeRequestNextBlock().

maybeRequestNextBlock() currently does:
 * sendRequest()
 * backoffDeadlineMs.set(NO_RETRY) (unconditional)


On the response path, handleUnsuccessfulResponse() does:
 * backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
 * requestInFlight.set(false)

 

Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs 
reset can run after handleUnsuccessfulResponse(), overwriting the newly-set 
backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent 
generateProducerId() call can re-send immediately, which may prefill 
nextProducerIdBlock earlier than expected and lead to test flakiness (and 
potentially unnecessary controller traffic).

 

In production, this race is less likely because the request/response path 
typically has higher latency than in the unit test (which simulates the 
controller response on a local executor). However, the code still has a 
correctness window where a newly set backoff deadline can be clobbered by an 
unconditional reset. Using compareAndSet to conditionally reset backoff 
preserves the intended behavior, avoids overwriting newer backoff values, and 
should have negligible performance impact (CAS is only executed on the request 
path, and contention should be rare). This also eliminates the observed test 
flakiness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to