[
https://issues.apache.org/jira/browse/KAFKA-20058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sanghyeok An updated KAFKA-20058:
---------------------------------
Description:
While investigating a flaky failure in
ProducerIdManagerTest.testRetryBackoffOnNoResponse,
I found a race in RPCProducerIdManager.maybeRequestNextBlock().
maybeRequestNextBlock() currently does:
* sendRequest()
* backoffDeadlineMs.set(NO_RETRY) (unconditional)
On the response path, handleUnsuccessfulResponse() does:
* backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
* requestInFlight.set(false)
Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs
reset can run after handleUnsuccessfulResponse(), overwriting the newly-set
backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent
generateProducerId() call can re-send immediately, which may prefill
nextProducerIdBlock earlier than expected and lead to test flakiness (and
potentially unnecessary controller traffic).
In production, this race is less likely because the request/response path
typically has higher latency than in the unit test (which simulates the
controller response on a local executor). However, the code still has a
correctness window where a newly set backoff deadline can be clobbered by an
unconditional reset. Using compareAndSet to conditionally reset backoff
preserves the intended behavior, avoids overwriting newer backoff values, and
should have negligible performance impact (CAS is only executed on the request
path, and contention should be rare). This also eliminates the observed test
flakiness.
was:
While investigating a flaky failure in
ProducerIdManagerTest.testRetryBackoffOnNoResponse, I found a race in
RPCProducerIdManager.maybeRequestNextBlock().
maybeRequestNextBlock() currently does:
* sendRequest()
* backoffDeadlineMs.set(NO_RETRY) (unconditional)
On the response path, handleUnsuccessfulResponse() does:
* backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
* requestInFlight.set(false)
Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs
reset can run after handleUnsuccessfulResponse(), overwriting the newly-set
backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent
generateProducerId() call can re-send immediately, which may prefill
nextProducerIdBlock earlier than expected and lead to test flakiness (and
potentially unnecessary controller traffic).
In production, this race is less likely because the request/response path
typically has higher latency than in the unit test (which simulates the
controller response on a local executor). However, the code still has a
correctness window where a newly set backoff deadline can be clobbered by an
unconditional reset. Using compareAndSet to conditionally reset backoff
preserves the intended behavior, avoids overwriting newer backoff values, and
should have negligible performance impact (CAS is only executed on the request
path, and contention should be rare). This also eliminates the observed test
flakiness.
> Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing
> premature retries
> -----------------------------------------------------------------------------------------
>
> Key: KAFKA-20058
> URL: https://issues.apache.org/jira/browse/KAFKA-20058
> Project: Kafka
> Issue Type: Bug
> Reporter: sanghyeok An
> Assignee: sanghyeok An
> Priority: Minor
> Labels: transaction
>
> While investigating a flaky failure in
> ProducerIdManagerTest.testRetryBackoffOnNoResponse,
> I found a race in RPCProducerIdManager.maybeRequestNextBlock().
> maybeRequestNextBlock() currently does:
> * sendRequest()
> * backoffDeadlineMs.set(NO_RETRY) (unconditional)
> On the response path, handleUnsuccessfulResponse() does:
> * backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
> * requestInFlight.set(false)
>
> Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs
> reset can run after handleUnsuccessfulResponse(), overwriting the newly-set
> backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent
> generateProducerId() call can re-send immediately, which may prefill
> nextProducerIdBlock earlier than expected and lead to test flakiness (and
> potentially unnecessary controller traffic).
>
> In production, this race is less likely because the request/response path
> typically has higher latency than in the unit test (which simulates the
> controller response on a local executor). However, the code still has a
> correctness window where a newly set backoff deadline can be clobbered by an
> unconditional reset. Using compareAndSet to conditionally reset backoff
> preserves the intended behavior, avoids overwriting newer backoff values, and
> should have negligible performance impact (CAS is only executed on the
> request path, and contention should be rare). This also eliminates the
> observed test flakiness.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)