adixitconfluent commented on PR #17965:
URL: https://github.com/apache/kafka/pull/17965#issuecomment-2509650770

   > @adixitconfluent Thanks for the PR, but for my understanding when can this 
scenario happen? We take a lock in the `acquire` which means a single 
thread/client can have access. And acquire method is synchronous update hence 
state will be fully transitioned. How do we encounter this issue?
   
   @apoorvmittal10, the issue can occur when the writeState RPC hasn't 
completed writing updates to the persister for AVAILABLE records which could 
result in leaked/null `acquisitionLockTimeoutTask`. Quoting the issue mentioned 
on the ticket for details- 
   
   `SharePartition#acquire` does not honor the rollback state [1][2]. This 
causes two issues.
   
   a. leaked `acquisitionLockTimeoutTask - `SharePartition#acquire` create a 
new `acquisitionLockTimeoutTask` for the "available" batch, however, the 
available batch in transition already has a `acquisitionLockTimeoutTask`, so 
the leaked `acquisitionLockTimeoutTask` will corrupt the state later ...
   
   b. null `acquisitionLockTimeoutTask` in a "acquired" batch - this can be 
reproduced by following order.
   
   the batch is in transition - current state is `AVAILABLE` and rollback state 
is `ACQUIRED`
   `SharePartition#rollbackOrProcessStateUpdates` is processing RPC, so it does 
not call `InFlightState#completeStateTransition`
   `SharePartition#acquire` assumes the batch is available, so it changes the 
state from `AVAILABLE` to `ACQUIRED` and create a new 
`acquisitionLockTimeoutTask` (see a.)
   `SharePartition#rollbackOrProcessStateUpdates` complete the RPC - it commit 
the state and cancel the `acquisitionLockTimeoutTask` - that means the batch is 
in `ACQUIRED` but it does not have `acquisitionLockTimeoutTask`
   the next AcknowledgeRequest tries to update the state to `ACKNOWLEDGED` but 
it encounters NPE `acquisitionLockTimeoutTask`[3] and then the request gets 
hanging until timeout
    
   
   [1] 
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L665
   
   [2] 
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L646
   
   [3] 
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1663


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to