[
https://issues.apache.org/jira/browse/KAFKA-20058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sanghyeok An updated KAFKA-20058:
---------------------------------
Labels: transaction (was: )
> Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing
> premature retries
> -----------------------------------------------------------------------------------------
>
> Key: KAFKA-20058
> URL: https://issues.apache.org/jira/browse/KAFKA-20058
> Project: Kafka
> Issue Type: Bug
> Reporter: sanghyeok An
> Assignee: sanghyeok An
> Priority: Minor
> Labels: transaction
>
> While investigating a flaky failure in
> ProducerIdManagerTest.testRetryBackoffOnNoResponse, I found a race in
> RPCProducerIdManager.maybeRequestNextBlock().
> maybeRequestNextBlock() currently does:
> * sendRequest()
> * backoffDeadlineMs.set(NO_RETRY) (unconditional)
> On the response path, handleUnsuccessfulResponse() does:
> * backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
> * requestInFlight.set(false)
>
> Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs
> reset can run after handleUnsuccessfulResponse(), overwriting the newly-set
> backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent
> generateProducerId() call can re-send immediately, which may prefill
> nextProducerIdBlock earlier than expected and lead to test flakiness (and
> potentially unnecessary controller traffic).
>
> In production, this race is less likely because the request/response path
> typically has higher latency than in the unit test (which simulates the
> controller response on a local executor). However, the code still has a
> correctness window where a newly set backoff deadline can be clobbered by an
> unconditional reset. Using compareAndSet to conditionally reset backoff
> preserves the intended behavior, avoids overwriting newer backoff values, and
> should have negligible performance impact (CAS is only executed on the
> request path, and contention should be rare). This also eliminates the
> observed test flakiness.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)