[ 
https://issues.apache.org/jira/browse/HDDS-15125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18076672#comment-18076672
 ] 

Wei-Chiu Chuang commented on HDDS-15125:
----------------------------------------

I think this proposal has its own merit: opening/closing RocksDB instances has 
a visible overhead, usually on the order of hundred milliseconds to 0.5 
seconds. Though we cache Snapshot RocksDB instances so the impact shouldn't be 
that terrible.

But IMO the thing is the key deletion happens in the background, so unless it 
becomes so terrible it blocks foreground traffic, it's unlikely to cause 
visible impact. Right now, we are prioritizing the efficient Snapshot Diff 
project which has a great impact to cluster replication SLAs.

> MVCC-Style Snapshot Reclaimability Using seqNumMin / seqNumMax
> --------------------------------------------------------------
>
>                 Key: HDDS-15125
>                 URL: https://issues.apache.org/jira/browse/HDDS-15125
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: OM, Snapshot
>            Reporter: Chu Cheng Li
>            Assignee: Chu Cheng Li
>            Priority: Major
>
> h2. Summary
> Refactor snapshot key reclamation so newly written deleted-key versions can 
> be evaluated by MVCC-style visibility intervals instead of opening previous 
> snapshot RocksDB instances and comparing key/block state.
> A key version is visible to a snapshot when:
> {code:java}
> seqNumMin <= snapshotCreateSeqNum < seqNumMax
> {code}
> A deleted key version is reclaimable when no active snapshot for that bucket 
> has a create sequence number inside that interval.
> This is a phase-1 optimization for {{deletedTable}} key versions only. 
> Existing snapshot deletion movement remains unchanged, and legacy entries 
> without sequence interval metadata continue using the current 
> {{ReclaimableKeyFilter}} fallback.
> h2. Schema And Data Model
> ||Table||Value Type||Change||
> |{{snapshotInfoTable}}|{{SnapshotInfo}}|No new field. Use existing 
> {{createTransactionInfo}} as the snapshot create sequence source. Do not 
> reuse deprecated {{{}dbTxSequenceNumber{}}}.|
> |{{keyTable}} / {{fileTable}}|{{KeyInfo}} via {{OmKeyInfo}}|Add optional 
> {{{}uint64 seqNumMin = 24{}}}. Active key versions do not set 
> {{{}seqNumMax{}}}.|
> |{{deletedTable}}|{{RepeatedKeyInfo}} containing {{KeyInfo}}|Store 
> {{seqNumMin}} and {{seqNumMax}} on each contained {{{}KeyInfo{}}}. Add 
> optional {{{}uint64 seqNumMax = 25{}}}.|
> |{{deletedDirTable}}|{{OmKeyInfo}}|No v1 schema behavior change beyond fields 
> naturally existing on {{{}KeyInfo{}}}; directory interval optimization is out 
> of v1.|
> |{{snapshotRenamedTable}}|{{String}}|No change in v1. Keep current 
> previous-snapshot lookup logic for rename entries.|
> Add nullable boxed fields to {{OmKeyInfo}} and its builder:
> {code:java}
> private Long seqNumMin;
> private Long seqNumMax;
> {code}
> Serialize only when non-null. Missing fields mean “not eligible for interval 
> optimization”.
> h2. Fill Rules
> Use OM/Ratis transaction index as the sequence number.
> For active key versions:
>  * On new committed key creation, set {{{}seqNumMin = 
> currentTransactionIndex{}}}.
>  * On overwrite or data-changing commit, set the new key version’s 
> {{{}seqNumMin = currentTransactionIndex{}}}.
>  * On metadata-only operations such as ACL changes, mtime changes, and 
> rename, preserve existing {{{}seqNumMin{}}}.
>  * If an existing legacy key has no {{{}seqNumMin{}}}, do not infer one from 
> {{{}updateID{}}}; keep it missing so deletion falls back to the old logic.
> For deleted key versions:
>  * Centralize deleted-key preparation in 
> {{{}OmUtils.prepareKeyForDelete(...){}}}.
>  * Preserve the source key’s {{{}seqNumMin{}}}.
>  * Set {{{}seqNumMax = currentTransactionIndex{}}}, where the current 
> transaction is the delete or overwrite transaction that makes this version no 
> longer visible.
>  * For pseudo/uncommitted block reclaim records that were never visible to 
> snapshots, set {{seqNumMin == seqNumMax == currentTransactionIndex}} or keep 
> the existing unconditional reclaim marker path. The interval must be empty.
>  * When {{SnapshotDeletingService}} moves deleted entries from a deleted 
> snapshot to the next active snapshot or AOS, preserve {{seqNumMin}} and 
> {{seqNumMax}} unchanged.
> For snapshots:
>  * Continue setting {{SnapshotInfo.createTransactionInfo}} in snapshot create.
>  * Extract the transaction index from {{createTransactionInfo}} for active 
> snapshot visibility checks.
>  * If any active snapshot for a bucket lacks {{{}createTransactionInfo{}}}, 
> disable interval optimization for that bucket and use the current fallback.
> h2. Reclaim Algorithm
> Build an in-memory sorted {{long[]}} of active snapshot create sequence 
> numbers per snapshot path:
> {code:java}
> /volume/bucket -> sorted active snapshot create seq nums
> {code}
> Include only {{SNAPSHOT_ACTIVE}} snapshots. {{SNAPSHOT_DELETED}} snapshots no 
> longer protect user visibility.
> For each deleted key version with both fields present:
> {code:java}
> int pos = lowerBound(activeSnapshotSeqNums, seqNumMin);
> boolean referenced = pos < activeSnapshotSeqNums.length
>     && activeSnapshotSeqNums[pos] < seqNumMax;
> boolean reclaimable = !referenced;
> {code}
> If {{seqNumMin}} or {{seqNumMax}} is missing, or the bucket’s active snapshot 
> sequence array cannot be built exactly, use the current 
> {{{}ReclaimableKeyFilter{}}}.
> Keep the existing snapshot-chain race protection:
>  * Capture {{expectedPreviousSnapshotId}} before filtering.
>  * Validate it before submitting purge.
>  * Include it in the purge request.
>  * If a snapshot is created between filtering and purge, the purge request 
> must be rejected or skipped and retried with the new active snapshot sequence 
> array.
> h2. Complexity And Sweet Spot
> The configured default snapshot limit is currently {{{}10000{}}}, but this 
> design remains efficient at the discussed maximum of {{{}65535{}}}.
> Current key reclaimability path:
> {code:java}
> O(D * previous snapshot DB lookups + D * block comparison cost)
> {code}
> Proposed phase-1 path for interval-enabled entries:
> {code:java}
> O(S) to build or refresh the active snapshot sequence array
> O(D * log S) to evaluate deleted key versions
> {code}
> At {{{}S = 65535{}}}:
> {code:java}
> log2(S) ~= 16
> {code}
> So each optimized deleted key version needs about 16 long comparisons and no 
> previous snapshot RocksDB lookups.
> Use sorted primitive arrays and binary search as the sweet spot. Do not 
> introduce a bitmap or persistent reclaim index in v1:
>  * A {{long[]}} is exact, simple, and small.
>  * 65,535 longs is about 512 KB at the extreme.
>  * A bitmap only works cleanly for dense snapshot ordinals, not sparse OM 
> transaction indexes.
>  * A new persistent index adds migration and write-amplification cost that is 
> not needed for the first optimization.
> Snapshot deletion movement remains:
> {code:java}
> O(M)
> {code}
> where {{M}} is the number of deleted/renamed/dir entries moved. Eliminating 
> repeated movement requires a later design that stores deleted-version records 
> in a global or per-bucket deleted-version table.
> h2. Backward Compatibility And Rollout
>  * Adding optional protobuf fields {{seqNumMin = 24}} and {{seqNumMax = 25}} 
> to {{KeyInfo}} is wire-compatible.
>  * Old DB records will not have the fields and must use fallback behavior.
>  * Do not use {{0}} as a sentinel; rely on protobuf field presence.
>  * Old OMs may drop unknown fields if they rewrite a key through 
> {{{}OmKeyInfo{}}}; this is safe because missing fields trigger fallback.
>  * No in-place migration is required for v1.
>  * Add metrics for optimized vs fallback reclaimability decisions.
>  * Gate interval-based filtering behind a layout feature or feature flag so 
> it can be enabled after upgrade confidence. Field writing is safe, but the 
> optimization should remain fallback-capable at all times.
> h2. Test Plan
>  * Proto round-trip tests for {{OmKeyInfo}} with present and missing 
> {{seqNumMin}} / {{{}seqNumMax{}}}.
>  * Unit tests for key create, overwrite, delete, and metadata-only update 
> fill rules.
>  * Unit tests confirming legacy keys without {{seqNumMin}} are not inferred 
> from {{{}updateID{}}}.
>  * Reclaim filter tests for interval boundaries: before interval, inside 
> interval, at {{{}seqNumMax{}}}, empty interval, and no active snapshots.
>  * Reclaim filter test with 65,535 active snapshot sequence numbers to 
> validate binary-search behavior and memory assumptions.
>  * Fallback tests when key interval fields are missing or active snapshot 
> create sequence is missing.
>  * Snapshot deletion move tests verifying interval fields are preserved when 
> deleted entries move to the next active snapshot or AOS.
>  * Race test where a snapshot is created after filtering but before purge; 
> purge must be rejected/skipped through existing previous-snapshot validation.
>  * Integration test with snapshots around create/delete/overwrite operations 
> to verify user-visible snapshot reads remain correct and only unreferenced 
> blocks are reclaimed.
> h2. Assumptions
>  * {{seqNumMin}} and {{seqNumMax}} use OM transaction indexes, not RocksDB 
> sequence numbers. The main distinction is that OM txn id is the {*}logical 
> visibility clock{*}, while RocksDB sequence number is a {*}local physical 
> storage clock{*}. Snapshot key reclamation is about whether a user-visible 
> key version existed at a snapshot create operation, so the logical OM 
> transaction order is the better fit.
>  * {{seqNumMax}} is exclusive.
>  * V1 optimizes deleted key reclamation only.
>  * Rename entries and directory reclamation continue using existing logic in 
> v1.
>  * Existing snapshot deletion movement remains in place in v1.
>  * Missing sequence interval metadata always means fallback to current 
> behavior, never “safe to reclaim”.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to