michaeljmarshall opened a new pull request, #18237: URL: https://github.com/apache/pulsar/pull/18237
Fixes: https://github.com/apache/pulsar/issues/18236 ### Motivation See https://github.com/apache/pulsar/issues/18236 for details on the problematic behavior. The fundamental problem is that the cursor persists the `readPosition` instead of persisting the `markDeletePosition` in the `ManagedCursor#internalResetCursor` method. This method is only called on specific occasions, the broker persists the correct `markDeletePosition` in zookeeper, and the bookkeeper's cursor data is only used when the broker doesn't update the cursor in zookeeper, so this bug is rare. I found this bug at first by producing to a newly created topic and subscription where the subscription was set to "latest". In that case, when the broker evaluates the following code, the `position` evaluates to `ledgerId:0` instead of `ledgerId:-1`. It seems that this bug affects all reset cursor calls except "earliest" because of the way "earliest" always goes to entryId -1. https://github.com/apache/pulsar/blob/fa328a42d5770a27237d926eac97385284876174/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java#L1186-L1191 After setting the position incorrectly for `latest` and for regular resets, the method persists `newPosition` (which is defined by `position`) as the `markDeletePosition`: https://github.com/apache/pulsar/blob/fa328a42d5770a27237d926eac97385284876174/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java#L1281 As a result, the cursor's persisted data is incorrect for those two cases. Interestingly, we do not see this bug much because a cleanly closed cursor ledger persists its `markDeletePosition` in [Zookeeper](https://github.com/apache/pulsar/blob/fa328a42d5770a27237d926eac97385284876174/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java#L2525), and when the broker recovers the cursor, it just uses the zookeeper data: https://github.com/apache/pulsar/blob/fa328a42d5770a27237d926eac97385284876174/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java#L458 Note that the only reason the zk metadata is correct is because we update the `markDeletePosition` to a new value in a callback run _after_ persisting a different `markDeletePosition`: https://github.com/apache/pulsar/blob/fa328a42d5770a27237d926eac97385284876174/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java#L1217 Note also that this bug does not occur unless you produce a message to the topic after triggering the `resetCursor` logic. If you do not produce a message, you'll see a log line like `Current position 4:0 is ahead of last position 4:-1`, which was introduced by https://github.com/apache/pulsar/pull/2673. It's possible that PR partially found this bug. Finally, this bug may be related to https://github.com/apache/pulsar/pull/15031 https://github.com/apache/pulsar/pull/15067. ### Modifications * Update `ManagedCursor#internalResetCursor` so that it persists the `markDeletePosition` instead of the `readPosition` to the cursor's bookkeeper ledger. The rest of the changes make sure that the correct variables are shared around. * Improve logging to make it clearer that `resetCursor` updates the `readPosition` * Fix test assertion ordering for semi-related test (this test failed as I iterated over the solution) ### Verifying this change This PR includes two new tests and all current tests continue to pass. ### Does this pull request potentially affect one of the following parts: The only possible "breaking" change is that this PR asserts that the `resetCursor` logic resets the read position (not the mark delete position). I think this is very sensible, but I'll need verification. Note that we already follow this logic in the way that we update the `markDeletePosition` in memory and in zookeeper. This change just updates what gets written to the cursor's ledger. ### Documentation - [x] `doc` ### Matching PR in forked repository PR in forked repository: https://github.com/michaeljmarshall/pulsar/pull/7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
