Chia-Ping Tsai created KAFKA-18084:
--------------------------------------
Summary: Null and leaked AcquisitionLockTimerTask causes hanging
AcknowledgeRequest and corrupted state of batch
Key: KAFKA-18084
URL: https://issues.apache.org/jira/browse/KAFKA-18084
Project: Kafka
Issue Type: Sub-task
Reporter: Chia-Ping Tsai
Assignee: Chia-Ping Tsai
I noticed some critical issues in reading shared-related code
1)
`SharePartition#rollbackOrProcessStateUpdates` does not hold the write lock in
updating state so it could result in race condition. noted that the
`DefaultStatePersister` uses a internal thread [1] to complete those callback
2)
`SharePartition#acquire` does not honor the rollback state [2][3]. This causes
two issues.
2.1) leaked `acquisitionLockTimeoutTask - `SharePartition#acquire` create a new
`acquisitionLockTimeoutTask` for the "available" batch, however, the available
batch in transition already has a `acquisitionLockTimeoutTask`, so the leaked
`acquisitionLockTimeoutTask` will corrupt the state later ...
2.2) null `acquisitionLockTimeoutTask` in a "acquired" batch - this can be
reproduced by following order.
- the batch is in transition - current state is `AVAILABLE` and rollback state
is `ACQUIRED`
- `SharePartition#rollbackOrProcessStateUpdates` is processing RPC, so it does
not call `InFlightState#completeStateTransition`
- `SharePartition#acquire` assumes the batch is available, so it changes the
state from `AVAILABLE` to `ACQUIRED` and create a new
`acquisitionLockTimeoutTask` (see 2.1)
- `SharePartition#rollbackOrProcessStateUpdates` complete the RPC - it commit
the state and cancel the `acquisitionLockTimeoutTask` - that means the batch is
in `ACQUIRED` but it does not have `acquisitionLockTimeoutTask`
- the next AcknowledgeRequest tries to update the state to `ACKNOWLEDGED` but
it encounters NPE `acquisitionLockTimeoutTask`[4] and then the request gets
hanging until timeout
[0]
https://github.com/apache/kafka/blob/654ebe10f4a5c31e449b2a2ef6c284254ed7dceb/core/src/main/java/kafka/server/share/SharePartition.java#L1649
[1]
https://github.com/apache/kafka/blob/trunk/share/src/main/java/org/apache/kafka/server/share/persister/PersisterStateManager.java#L80
[2]
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L665
[3]
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L646
[4]
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1663
--
This message was sent by Atlassian Jira
(v8.20.10#820010)