Chu Cheng Li created HDDS-15125:
-----------------------------------
Summary: MVCC-Style Snapshot Reclaimability Using seqNumMin /
seqNumMax
Key: HDDS-15125
URL: https://issues.apache.org/jira/browse/HDDS-15125
Project: Apache Ozone
Issue Type: Improvement
Components: OM, Snapshot
Reporter: Chu Cheng Li
h2. Summary
Refactor snapshot key reclamation so newly written deleted-key versions can be
evaluated by MVCC-style visibility intervals instead of opening previous
snapshot RocksDB instances and comparing key/block state.
A key version is visible to a snapshot when:
{code:java}
seqNumMin <= snapshotCreateSeqNum < seqNumMax
{code}
A deleted key version is reclaimable when no active snapshot for that bucket
has a create sequence number inside that interval.
This is a phase-1 optimization for {{deletedTable}} key versions only. Existing
snapshot deletion movement remains unchanged, and legacy entries without
sequence interval metadata continue using the current {{ReclaimableKeyFilter}}
fallback.
h2. Schema And Data Model
||Table||Value Type||Change||
|{{snapshotInfoTable}}|{{SnapshotInfo}}|No new field. Use existing
{{createTransactionInfo}} as the snapshot create sequence source. Do not reuse
deprecated {{{}dbTxSequenceNumber{}}}.|
|{{keyTable}} / {{fileTable}}|{{KeyInfo}} via {{OmKeyInfo}}|Add optional
{{{}uint64 seqNumMin = 24{}}}. Active key versions do not set
{{{}seqNumMax{}}}.|
|{{deletedTable}}|{{RepeatedKeyInfo}} containing {{KeyInfo}}|Store
{{seqNumMin}} and {{seqNumMax}} on each contained {{{}KeyInfo{}}}. Add optional
{{{}uint64 seqNumMax = 25{}}}.|
|{{deletedDirTable}}|{{OmKeyInfo}}|No v1 schema behavior change beyond fields
naturally existing on {{{}KeyInfo{}}}; directory interval optimization is out
of v1.|
|{{snapshotRenamedTable}}|{{String}}|No change in v1. Keep current
previous-snapshot lookup logic for rename entries.|
Add nullable boxed fields to {{OmKeyInfo}} and its builder:
{code:java}
private Long seqNumMin;
private Long seqNumMax;
{code}
Serialize only when non-null. Missing fields mean “not eligible for interval
optimization”.
h2. Fill Rules
Use OM/Ratis transaction index as the sequence number.
For active key versions:
* On new committed key creation, set {{{}seqNumMin =
currentTransactionIndex{}}}.
* On overwrite or data-changing commit, set the new key version’s
{{{}seqNumMin = currentTransactionIndex{}}}.
* On metadata-only operations such as ACL changes, mtime changes, and rename,
preserve existing {{{}seqNumMin{}}}.
* If an existing legacy key has no {{{}seqNumMin{}}}, do not infer one from
{{{}updateID{}}}; keep it missing so deletion falls back to the old logic.
For deleted key versions:
* Centralize deleted-key preparation in
{{{}OmUtils.prepareKeyForDelete(...){}}}.
* Preserve the source key’s {{{}seqNumMin{}}}.
* Set {{{}seqNumMax = currentTransactionIndex{}}}, where the current
transaction is the delete or overwrite transaction that makes this version no
longer visible.
* For pseudo/uncommitted block reclaim records that were never visible to
snapshots, set {{seqNumMin == seqNumMax == currentTransactionIndex}} or keep
the existing unconditional reclaim marker path. The interval must be empty.
* When {{SnapshotDeletingService}} moves deleted entries from a deleted
snapshot to the next active snapshot or AOS, preserve {{seqNumMin}} and
{{seqNumMax}} unchanged.
For snapshots:
* Continue setting {{SnapshotInfo.createTransactionInfo}} in snapshot create.
* Extract the transaction index from {{createTransactionInfo}} for active
snapshot visibility checks.
* If any active snapshot for a bucket lacks {{{}createTransactionInfo{}}},
disable interval optimization for that bucket and use the current fallback.
h2. Reclaim Algorithm
Build an in-memory sorted {{long[]}} of active snapshot create sequence numbers
per snapshot path:
{code:java}
/volume/bucket -> sorted active snapshot create seq nums
{code}
Include only {{SNAPSHOT_ACTIVE}} snapshots. {{SNAPSHOT_DELETED}} snapshots no
longer protect user visibility.
For each deleted key version with both fields present:
{code:java}
int pos = lowerBound(activeSnapshotSeqNums, seqNumMin);
boolean referenced = pos < activeSnapshotSeqNums.length
&& activeSnapshotSeqNums[pos] < seqNumMax;
boolean reclaimable = !referenced;
{code}
If {{seqNumMin}} or {{seqNumMax}} is missing, or the bucket’s active snapshot
sequence array cannot be built exactly, use the current
{{{}ReclaimableKeyFilter{}}}.
Keep the existing snapshot-chain race protection:
* Capture {{expectedPreviousSnapshotId}} before filtering.
* Validate it before submitting purge.
* Include it in the purge request.
* If a snapshot is created between filtering and purge, the purge request must
be rejected or skipped and retried with the new active snapshot sequence array.
h2. Complexity And Sweet Spot
The configured default snapshot limit is currently {{{}10000{}}}, but this
design remains efficient at the discussed maximum of {{{}65535{}}}.
Current key reclaimability path:
{code:java}
O(D * previous snapshot DB lookups + D * block comparison cost)
{code}
Proposed phase-1 path for interval-enabled entries:
{code:java}
O(S) to build or refresh the active snapshot sequence array
O(D * log S) to evaluate deleted key versions
{code}
At {{{}S = 65535{}}}:
{code:java}
log2(S) ~= 16
{code}
So each optimized deleted key version needs about 16 long comparisons and no
previous snapshot RocksDB lookups.
Use sorted primitive arrays and binary search as the sweet spot. Do not
introduce a bitmap or persistent reclaim index in v1:
* A {{long[]}} is exact, simple, and small.
* 65,535 longs is about 512 KB at the extreme.
* A bitmap only works cleanly for dense snapshot ordinals, not sparse OM
transaction indexes.
* A new persistent index adds migration and write-amplification cost that is
not needed for the first optimization.
Snapshot deletion movement remains:
{code:java}
O(M)
{code}
where {{M}} is the number of deleted/renamed/dir entries moved. Eliminating
repeated movement requires a later design that stores deleted-version records
in a global or per-bucket deleted-version table.
h2. Backward Compatibility And Rollout
* Adding optional protobuf fields {{seqNumMin = 24}} and {{seqNumMax = 25}} to
{{KeyInfo}} is wire-compatible.
* Old DB records will not have the fields and must use fallback behavior.
* Do not use {{0}} as a sentinel; rely on protobuf field presence.
* Old OMs may drop unknown fields if they rewrite a key through
{{{}OmKeyInfo{}}}; this is safe because missing fields trigger fallback.
* No in-place migration is required for v1.
* Add metrics for optimized vs fallback reclaimability decisions.
* Gate interval-based filtering behind a layout feature or feature flag so it
can be enabled after upgrade confidence. Field writing is safe, but the
optimization should remain fallback-capable at all times.
h2. Test Plan
* Proto round-trip tests for {{OmKeyInfo}} with present and missing
{{seqNumMin}} / {{{}seqNumMax{}}}.
* Unit tests for key create, overwrite, delete, and metadata-only update fill
rules.
* Unit tests confirming legacy keys without {{seqNumMin}} are not inferred
from {{{}updateID{}}}.
* Reclaim filter tests for interval boundaries: before interval, inside
interval, at {{{}seqNumMax{}}}, empty interval, and no active snapshots.
* Reclaim filter test with 65,535 active snapshot sequence numbers to validate
binary-search behavior and memory assumptions.
* Fallback tests when key interval fields are missing or active snapshot
create sequence is missing.
* Snapshot deletion move tests verifying interval fields are preserved when
deleted entries move to the next active snapshot or AOS.
* Race test where a snapshot is created after filtering but before purge;
purge must be rejected/skipped through existing previous-snapshot validation.
* Integration test with snapshots around create/delete/overwrite operations to
verify user-visible snapshot reads remain correct and only unreferenced blocks
are reclaimed.
h2. Assumptions
* {{seqNumMin}} and {{seqNumMax}} use OM transaction indexes, not RocksDB
sequence numbers. The main distinction is that OM txn id is the {*}logical
visibility clock{*}, while RocksDB sequence number is a {*}local physical
storage clock{*}. Snapshot key reclamation is about whether a user-visible key
version existed at a snapshot create operation, so the logical OM transaction
order is the better fit.
* {{seqNumMax}} is exclusive.
* V1 optimizes deleted key reclamation only.
* Rename entries and directory reclamation continue using existing logic in v1.
* Existing snapshot deletion movement remains in place in v1.
* Missing sequence interval metadata always means fallback to current
behavior, never “safe to reclaim”.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]