Chu Cheng Li created HDDS-15125:
-----------------------------------

             Summary: MVCC-Style Snapshot Reclaimability Using seqNumMin / 
seqNumMax
                 Key: HDDS-15125
                 URL: https://issues.apache.org/jira/browse/HDDS-15125
             Project: Apache Ozone
          Issue Type: Improvement
          Components: OM, Snapshot
            Reporter: Chu Cheng Li


h2. Summary

Refactor snapshot key reclamation so newly written deleted-key versions can be 
evaluated by MVCC-style visibility intervals instead of opening previous 
snapshot RocksDB instances and comparing key/block state.

A key version is visible to a snapshot when:
{code:java}
seqNumMin <= snapshotCreateSeqNum < seqNumMax
{code}
A deleted key version is reclaimable when no active snapshot for that bucket 
has a create sequence number inside that interval.

This is a phase-1 optimization for {{deletedTable}} key versions only. Existing 
snapshot deletion movement remains unchanged, and legacy entries without 
sequence interval metadata continue using the current {{ReclaimableKeyFilter}} 
fallback.
h2. Schema And Data Model
||Table||Value Type||Change||
|{{snapshotInfoTable}}|{{SnapshotInfo}}|No new field. Use existing 
{{createTransactionInfo}} as the snapshot create sequence source. Do not reuse 
deprecated {{{}dbTxSequenceNumber{}}}.|
|{{keyTable}} / {{fileTable}}|{{KeyInfo}} via {{OmKeyInfo}}|Add optional 
{{{}uint64 seqNumMin = 24{}}}. Active key versions do not set 
{{{}seqNumMax{}}}.|
|{{deletedTable}}|{{RepeatedKeyInfo}} containing {{KeyInfo}}|Store 
{{seqNumMin}} and {{seqNumMax}} on each contained {{{}KeyInfo{}}}. Add optional 
{{{}uint64 seqNumMax = 25{}}}.|
|{{deletedDirTable}}|{{OmKeyInfo}}|No v1 schema behavior change beyond fields 
naturally existing on {{{}KeyInfo{}}}; directory interval optimization is out 
of v1.|
|{{snapshotRenamedTable}}|{{String}}|No change in v1. Keep current 
previous-snapshot lookup logic for rename entries.|

Add nullable boxed fields to {{OmKeyInfo}} and its builder:
{code:java}
private Long seqNumMin;
private Long seqNumMax;
{code}
Serialize only when non-null. Missing fields mean “not eligible for interval 
optimization”.
h2. Fill Rules

Use OM/Ratis transaction index as the sequence number.

For active key versions:
 * On new committed key creation, set {{{}seqNumMin = 
currentTransactionIndex{}}}.
 * On overwrite or data-changing commit, set the new key version’s 
{{{}seqNumMin = currentTransactionIndex{}}}.
 * On metadata-only operations such as ACL changes, mtime changes, and rename, 
preserve existing {{{}seqNumMin{}}}.
 * If an existing legacy key has no {{{}seqNumMin{}}}, do not infer one from 
{{{}updateID{}}}; keep it missing so deletion falls back to the old logic.

For deleted key versions:
 * Centralize deleted-key preparation in 
{{{}OmUtils.prepareKeyForDelete(...){}}}.
 * Preserve the source key’s {{{}seqNumMin{}}}.
 * Set {{{}seqNumMax = currentTransactionIndex{}}}, where the current 
transaction is the delete or overwrite transaction that makes this version no 
longer visible.
 * For pseudo/uncommitted block reclaim records that were never visible to 
snapshots, set {{seqNumMin == seqNumMax == currentTransactionIndex}} or keep 
the existing unconditional reclaim marker path. The interval must be empty.
 * When {{SnapshotDeletingService}} moves deleted entries from a deleted 
snapshot to the next active snapshot or AOS, preserve {{seqNumMin}} and 
{{seqNumMax}} unchanged.

For snapshots:
 * Continue setting {{SnapshotInfo.createTransactionInfo}} in snapshot create.
 * Extract the transaction index from {{createTransactionInfo}} for active 
snapshot visibility checks.
 * If any active snapshot for a bucket lacks {{{}createTransactionInfo{}}}, 
disable interval optimization for that bucket and use the current fallback.

h2. Reclaim Algorithm

Build an in-memory sorted {{long[]}} of active snapshot create sequence numbers 
per snapshot path:
{code:java}
/volume/bucket -> sorted active snapshot create seq nums
{code}
Include only {{SNAPSHOT_ACTIVE}} snapshots. {{SNAPSHOT_DELETED}} snapshots no 
longer protect user visibility.

For each deleted key version with both fields present:
{code:java}
int pos = lowerBound(activeSnapshotSeqNums, seqNumMin);
boolean referenced = pos < activeSnapshotSeqNums.length
    && activeSnapshotSeqNums[pos] < seqNumMax;
boolean reclaimable = !referenced;
{code}
If {{seqNumMin}} or {{seqNumMax}} is missing, or the bucket’s active snapshot 
sequence array cannot be built exactly, use the current 
{{{}ReclaimableKeyFilter{}}}.

Keep the existing snapshot-chain race protection:
 * Capture {{expectedPreviousSnapshotId}} before filtering.
 * Validate it before submitting purge.
 * Include it in the purge request.
 * If a snapshot is created between filtering and purge, the purge request must 
be rejected or skipped and retried with the new active snapshot sequence array.

h2. Complexity And Sweet Spot

The configured default snapshot limit is currently {{{}10000{}}}, but this 
design remains efficient at the discussed maximum of {{{}65535{}}}.

Current key reclaimability path:
{code:java}
O(D * previous snapshot DB lookups + D * block comparison cost)
{code}
Proposed phase-1 path for interval-enabled entries:
{code:java}
O(S) to build or refresh the active snapshot sequence array
O(D * log S) to evaluate deleted key versions
{code}
At {{{}S = 65535{}}}:
{code:java}
log2(S) ~= 16
{code}
So each optimized deleted key version needs about 16 long comparisons and no 
previous snapshot RocksDB lookups.

Use sorted primitive arrays and binary search as the sweet spot. Do not 
introduce a bitmap or persistent reclaim index in v1:
 * A {{long[]}} is exact, simple, and small.
 * 65,535 longs is about 512 KB at the extreme.
 * A bitmap only works cleanly for dense snapshot ordinals, not sparse OM 
transaction indexes.
 * A new persistent index adds migration and write-amplification cost that is 
not needed for the first optimization.

Snapshot deletion movement remains:
{code:java}
O(M)
{code}
where {{M}} is the number of deleted/renamed/dir entries moved. Eliminating 
repeated movement requires a later design that stores deleted-version records 
in a global or per-bucket deleted-version table.
h2. Backward Compatibility And Rollout
 * Adding optional protobuf fields {{seqNumMin = 24}} and {{seqNumMax = 25}} to 
{{KeyInfo}} is wire-compatible.
 * Old DB records will not have the fields and must use fallback behavior.
 * Do not use {{0}} as a sentinel; rely on protobuf field presence.
 * Old OMs may drop unknown fields if they rewrite a key through 
{{{}OmKeyInfo{}}}; this is safe because missing fields trigger fallback.
 * No in-place migration is required for v1.
 * Add metrics for optimized vs fallback reclaimability decisions.
 * Gate interval-based filtering behind a layout feature or feature flag so it 
can be enabled after upgrade confidence. Field writing is safe, but the 
optimization should remain fallback-capable at all times.

h2. Test Plan
 * Proto round-trip tests for {{OmKeyInfo}} with present and missing 
{{seqNumMin}} / {{{}seqNumMax{}}}.
 * Unit tests for key create, overwrite, delete, and metadata-only update fill 
rules.
 * Unit tests confirming legacy keys without {{seqNumMin}} are not inferred 
from {{{}updateID{}}}.
 * Reclaim filter tests for interval boundaries: before interval, inside 
interval, at {{{}seqNumMax{}}}, empty interval, and no active snapshots.
 * Reclaim filter test with 65,535 active snapshot sequence numbers to validate 
binary-search behavior and memory assumptions.
 * Fallback tests when key interval fields are missing or active snapshot 
create sequence is missing.
 * Snapshot deletion move tests verifying interval fields are preserved when 
deleted entries move to the next active snapshot or AOS.
 * Race test where a snapshot is created after filtering but before purge; 
purge must be rejected/skipped through existing previous-snapshot validation.
 * Integration test with snapshots around create/delete/overwrite operations to 
verify user-visible snapshot reads remain correct and only unreferenced blocks 
are reclaimed.

h2. Assumptions
 * {{seqNumMin}} and {{seqNumMax}} use OM transaction indexes, not RocksDB 
sequence numbers. The main distinction is that OM txn id is the {*}logical 
visibility clock{*}, while RocksDB sequence number is a {*}local physical 
storage clock{*}. Snapshot key reclamation is about whether a user-visible key 
version existed at a snapshot create operation, so the logical OM transaction 
order is the better fit.
 * {{seqNumMax}} is exclusive.
 * V1 optimizes deleted key reclamation only.
 * Rename entries and directory reclamation continue using existing logic in v1.
 * Existing snapshot deletion movement remains in place in v1.
 * Missing sequence interval metadata always means fallback to current 
behavior, never “safe to reclaim”.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to