merlimat opened a new pull request, #25821:
URL: https://github.com/apache/pulsar/pull/25821

   ## Summary
   
   Fixes a latent correctness bug in P3's `MetadataTransactionBuffer` 
([#25768](https://github.com/apache/pulsar/pull/25768)): a TB restart (or new 
`EARLIEST` subscription) on a topic with old committed transactional data 
filtered those committed messages, because the original `/txn/id/<txnId>` 
headers had been GC'd and P3's `isTxnAborted` defaulted "unknown → aborted" 
while cleanup-on-apply also removed the `/txn/op` records. Committed-and-GC'd 
became indistinguishable from orphan.
   
   This PR adds the durable per-segment visibility state that the legacy 
snapshot used to provide, relocated to the metadata store and pruned in step 
with the segment ML's trimming. Combined with a publish-path ordering reversal 
that eliminates orphans by construction, the design no longer needs the 
"unknown → aborted" default.
   
   Design note: `.claude/notes/pip-473-visibility-state.md` in the 
v5-transactions worktree.
   
   ### Publish-path ordering — no orphans
   
   `appendBufferToTxn` writes the `/txn/op` record **before** the ML append:
   
   1. Read header (cache-first). Reject if terminal.
   2. PUT `/txn/op/<txnId>-<seq>`.
   3. ML `asyncAddEntry`.
   4. Ack publisher.
   
   A crash between (2) and (3) leaves an op record with no data — TC times out 
→ ABORTED → participant cleanup deletes the record. There is never a ledger 
entry with no corresponding `/txn/op` record. WRITE op records carry sentinel 
0/0 positions; positions live in the TB's in-memory tracking only.
   
   ### Durable per-segment state
   
   Under `/txn/segment-state/...`:
   
   | Path | Schema |
   |---|---|
   | `/txn/segment-state/watermark/<encoded-segment>` | 
`SegmentWatermark(ledgerId, entryId)` |
   | `/txn/segment-state/aborted/<encoded-segment>:<txnId>` | 
`AbortedTxnRecord(maxLedgerId, maxEntryId)` |
   
   Both carry `partitionKey = segmentKey(segment)`. The aborted records also 
have a `idx:txn-aborted-by-position` secondary-index entry keyed by 
`<encoded-segment>:<padded-position>` — backend-agnostic range trimming.
   
   ### `isTxnAborted` semantics
   
   Flipped to "below watermark + not in aborted set → visible". The aborted set 
is hydrated from durable records at recovery and updated on abort-apply. 
Unknown txns now default visible — this is the bug fix.
   
   ### Recovery & apply ordering
   
   Recovery loads the durable watermark + aborted set, then scans `/txn/op` for 
in-flight open txns at last shutdown. Recovery-discovered opens have unknown 
positions; `maxReadPosition` pins at the watermark until they resolve. **No 
segment-replay needed** — the orphan-elimination invariant makes `/txn/op` an 
authoritative source.
   
   Apply ordering serialises through a `stateTail` `CompletableFuture` chain so 
concurrent applies don't race the watermark CAS: in-memory state change → write 
aborted record (if abort) → CAS watermark forward → delete `/txn/op` records.
   
   ### Test plan
   
   - [x] `pulsar-broker:test --tests MetadataTransactionBufferTest` — 10 cases, 
all green. Includes the bug-fix scenarios:
     - `restartAfterCommit_committedDataStillVisible`
     - `restartAfterAbort_abortedTxnStillFiltered`
     - `recoveryDiscoveredOpenTxn_pinsAtWatermark`
     - `publishOrdering_opRecordWrittenBeforeMlAppend`
   - [x] `pulsar-broker:test --tests TxnMetadataStoreTest` — 7 cases (P2) still 
green after the new segment-state helpers.
   - [x] `pulsar-broker:test --tests TxnIdsTest` — 7 cases.
   - [x] `pulsar-broker:test --tests MetadataPendingAckStoreTest` — 6 cases 
(P4) unaffected.
   - [x] `pulsar-broker:test --tests Dispatching*Test` — 4 routing tests pass.
   - [x] Checkstyle clean on `pulsar-broker`.
   
   ### Deferred / follow-ups
   
   - **ML trim hook for pruning aborted records.** The 
`idx:txn-aborted-by-position` index is in place; wiring the actual 
scan-and-delete to ML trim events is a follow-up. Aborted records accumulate 
until segment teardown for now.
   - **PendingAck-side cursor durability check.** Verify in P4's apply path 
that `handle.commitTxn` / `abortTxn`'s future completes only after the cursor 
write is durable (so we never delete `/txn/op` ACK records before the cursor 
reflects the commit). Separate small verification.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to