merlimat opened a new pull request, #25821: URL: https://github.com/apache/pulsar/pull/25821
## Summary Fixes a latent correctness bug in P3's `MetadataTransactionBuffer` ([#25768](https://github.com/apache/pulsar/pull/25768)): a TB restart (or new `EARLIEST` subscription) on a topic with old committed transactional data filtered those committed messages, because the original `/txn/id/<txnId>` headers had been GC'd and P3's `isTxnAborted` defaulted "unknown → aborted" while cleanup-on-apply also removed the `/txn/op` records. Committed-and-GC'd became indistinguishable from orphan. This PR adds the durable per-segment visibility state that the legacy snapshot used to provide, relocated to the metadata store and pruned in step with the segment ML's trimming. Combined with a publish-path ordering reversal that eliminates orphans by construction, the design no longer needs the "unknown → aborted" default. Design note: `.claude/notes/pip-473-visibility-state.md` in the v5-transactions worktree. ### Publish-path ordering — no orphans `appendBufferToTxn` writes the `/txn/op` record **before** the ML append: 1. Read header (cache-first). Reject if terminal. 2. PUT `/txn/op/<txnId>-<seq>`. 3. ML `asyncAddEntry`. 4. Ack publisher. A crash between (2) and (3) leaves an op record with no data — TC times out → ABORTED → participant cleanup deletes the record. There is never a ledger entry with no corresponding `/txn/op` record. WRITE op records carry sentinel 0/0 positions; positions live in the TB's in-memory tracking only. ### Durable per-segment state Under `/txn/segment-state/...`: | Path | Schema | |---|---| | `/txn/segment-state/watermark/<encoded-segment>` | `SegmentWatermark(ledgerId, entryId)` | | `/txn/segment-state/aborted/<encoded-segment>:<txnId>` | `AbortedTxnRecord(maxLedgerId, maxEntryId)` | Both carry `partitionKey = segmentKey(segment)`. The aborted records also have a `idx:txn-aborted-by-position` secondary-index entry keyed by `<encoded-segment>:<padded-position>` — backend-agnostic range trimming. ### `isTxnAborted` semantics Flipped to "below watermark + not in aborted set → visible". The aborted set is hydrated from durable records at recovery and updated on abort-apply. Unknown txns now default visible — this is the bug fix. ### Recovery & apply ordering Recovery loads the durable watermark + aborted set, then scans `/txn/op` for in-flight open txns at last shutdown. Recovery-discovered opens have unknown positions; `maxReadPosition` pins at the watermark until they resolve. **No segment-replay needed** — the orphan-elimination invariant makes `/txn/op` an authoritative source. Apply ordering serialises through a `stateTail` `CompletableFuture` chain so concurrent applies don't race the watermark CAS: in-memory state change → write aborted record (if abort) → CAS watermark forward → delete `/txn/op` records. ### Test plan - [x] `pulsar-broker:test --tests MetadataTransactionBufferTest` — 10 cases, all green. Includes the bug-fix scenarios: - `restartAfterCommit_committedDataStillVisible` - `restartAfterAbort_abortedTxnStillFiltered` - `recoveryDiscoveredOpenTxn_pinsAtWatermark` - `publishOrdering_opRecordWrittenBeforeMlAppend` - [x] `pulsar-broker:test --tests TxnMetadataStoreTest` — 7 cases (P2) still green after the new segment-state helpers. - [x] `pulsar-broker:test --tests TxnIdsTest` — 7 cases. - [x] `pulsar-broker:test --tests MetadataPendingAckStoreTest` — 6 cases (P4) unaffected. - [x] `pulsar-broker:test --tests Dispatching*Test` — 4 routing tests pass. - [x] Checkstyle clean on `pulsar-broker`. ### Deferred / follow-ups - **ML trim hook for pruning aborted records.** The `idx:txn-aborted-by-position` index is in place; wiring the actual scan-and-delete to ML trim events is a follow-up. Aborted records accumulate until segment teardown for now. - **PendingAck-side cursor durability check.** Verify in P4's apply path that `handle.commitTxn` / `abortTxn`'s future completes only after the cursor write is durable (so we never delete `/txn/op` ACK records before the cursor reflects the commit). Separate small verification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
