adixitconfluent commented on PR #17965:
URL: https://github.com/apache/kafka/pull/17965#issuecomment-2509650770
> @adixitconfluent Thanks for the PR, but for my understanding when can this
scenario happen? We take a lock in the `acquire` which means a single
thread/client can have access. And acquire method is synchronous update hence
state will be fully transitioned. How do we encounter this issue?
@apoorvmittal10, the issue can occur when the writeState RPC hasn't
completed writing updates to the persister for AVAILABLE records which could
result in leaked/null `acquisitionLockTimeoutTask`. Quoting the issue mentioned
on the ticket for details-
`SharePartition#acquire` does not honor the rollback state [1][2]. This
causes two issues.
a. leaked `acquisitionLockTimeoutTask - `SharePartition#acquire` create a
new `acquisitionLockTimeoutTask` for the "available" batch, however, the
available batch in transition already has a `acquisitionLockTimeoutTask`, so
the leaked `acquisitionLockTimeoutTask` will corrupt the state later ...
b. null `acquisitionLockTimeoutTask` in a "acquired" batch - this can be
reproduced by following order.
the batch is in transition - current state is `AVAILABLE` and rollback state
is `ACQUIRED`
`SharePartition#rollbackOrProcessStateUpdates` is processing RPC, so it does
not call `InFlightState#completeStateTransition`
`SharePartition#acquire` assumes the batch is available, so it changes the
state from `AVAILABLE` to `ACQUIRED` and create a new
`acquisitionLockTimeoutTask` (see a.)
`SharePartition#rollbackOrProcessStateUpdates` complete the RPC - it commit
the state and cancel the `acquisitionLockTimeoutTask` - that means the batch is
in `ACQUIRED` but it does not have `acquisitionLockTimeoutTask`
the next AcknowledgeRequest tries to update the state to `ACKNOWLEDGED` but
it encounters NPE `acquisitionLockTimeoutTask`[3] and then the request gets
hanging until timeout
[1]
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L665
[2]
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L646
[3]
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1663
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]