merlimat commented on PR #25821:
URL: https://github.com/apache/pulsar/pull/25821#issuecomment-4507495995

   Thanks for the thorough local review — went through all of it. Pushed fixes 
in 8753e3c.
   
   **Confirmed bugs, both fixed:**
   
   1. **Recovery race** — exactly right. A recovery-discovered ABORTED txn was 
only added to `abortedTxns` by the queued `applyTerminalNow` on `stateTail`, 
which can settle after `recoveryFuture` completes, and ABORTED recovery entries 
aren't watermark-pinned — so `isTxnAborted` could return false in that window. 
Now hydrate `abortedTxns` synchronously under the lock in 
`applyHeaderForRecovery`. The race only opens with 2+ terminal txns (the first 
applies synchronously since `stateTail` is complete; subsequent ones queue 
behind its in-flight persist), so the regression test forces that with two txns 
and a hung first `putAbortedTxn`. Verified it fails without the fix.
   
   2. **`persistAbortedRecord` max-position** — agreed, the math goes the wrong 
way. The txn's data sits *above* the watermark, so keying the prune on the 
watermark drops the record too early. Switched the unknown-position fallback to 
the segment LAC (`ledger.getLastConfirmedEntry()`) — a correct conservative 
upper bound since the data was written in a prior epoch. Regression test 
asserts the stored position is the LAC, not the watermark, and fails without 
the fix.
   
   **Quality items:**
   - Dropped the dead `wasRecoveryDiscovered` parameter (the `lastPos != null` 
check is the real gate).
   - Hoisted `abortedTxns.add()` out of the empty `else if` condition.
   - `persistWatermarkIfAdvanced` now self-heals on CAS `BadVersionException` — 
re-reads the watermark so the next retry uses a fresh version instead of 
looping.
   - Merged the two consecutive `synchronized` blocks in 
`syncMaxReadPositionForNormalPublish`.
   - Simplified the redundant `expectedVersion` ternary.
   - Tightened the `isTxnAborted` invariant comment to describe the actual 
`maxReadPosition` cap (lowest open txn's first write in steady state; watermark 
only while recovery-discovered opens remain) rather than implying a 
watermark-only cap.
   - Restored the dropped `isTxnAborted == false` + cross-segment isolation 
assertions in `recoveryDiscoveredOpenTxn_pinsAtWatermark`.
   
   On the broader synchronization refactor (from the inline comment): I'd like 
to keep that as a follow-up — the single-writer approach looks like the best 
fit since the buffer is already driven from per-topic ordered callbacks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to