sanghyeok An created KAFKA-20058:
------------------------------------
Summary: Fix race condition on backoffDeadlineMs on
RPCProducerIdManager causing premature retries
Key: KAFKA-20058
URL: https://issues.apache.org/jira/browse/KAFKA-20058
Project: Kafka
Issue Type: Bug
Reporter: sanghyeok An
Assignee: sanghyeok An
While investigating a flaky failure in
ProducerIdManagerTest.testRetryBackoffOnNoResponse, I found a race in
RPCProducerIdManager.maybeRequestNextBlock().
maybeRequestNextBlock() currently does:
* sendRequest()
* backoffDeadlineMs.set(NO_RETRY) (unconditional)
On the response path, handleUnsuccessfulResponse() does:
* backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
* requestInFlight.set(false)
Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs
reset can run after handleUnsuccessfulResponse(), overwriting the newly-set
backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent
generateProducerId() call can re-send immediately, which may prefill
nextProducerIdBlock earlier than expected and lead to test flakiness (and
potentially unnecessary controller traffic).
In production, this race is less likely because the request/response path
typically has higher latency than in the unit test (which simulates the
controller response on a local executor). However, the code still has a
correctness window where a newly set backoff deadline can be clobbered by an
unconditional reset. Using compareAndSet to conditionally reset backoff
preserves the intended behavior, avoids overwriting newer backoff values, and
should have negligible performance impact (CAS is only executed on the request
path, and contention should be rare). This also eliminates the observed test
flakiness.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)