smengcl commented on code in PR #10335: URL: https://github.com/apache/ozone/pull/10335#discussion_r3313879767
########## hadoop-hdds/docs/content/design/efficient-snapdiff.md: ########## @@ -0,0 +1,252 @@ +--- +title: Snapshot Diff Optimization +summary: Describe proposal for an optimized snapshot diff that uses mostly sequential reads and batch puts +date: 2025-05-22 +jira: HDDS-9154 +status: draft +author: Saketa Chalamchala +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +## 1. Introduction +This document outlines the technical design, architectural choices, and algorithmic improvements to optimize Ozone's Snapshot Diff feature. The design addresses performance bottlenecks in both the **Full Diff** and **DAG-based Diff** paths. The primary goals are to reduce random I/O, minimize CPU overhead from deserialization, and streamline the classification of differences. + + ## Goals + - Reduce random I/O. + - Minimize CPU cost of deserializing KeyInfo and DirectoryInfo for comparisons. + - Keep baseline diff semantics for CREATE/DELETE/RENAME/MODIFY where possible. + +--- + +## 2. Core Design Choices & Optimizations + +### 2.1. Sequential Reads & Table Iterators +**Baseline Issue:** Baseline full diff enumerates keys via SST readers (plus per-key `db.get` lookups), and the DAG-based diff relies heavily on random point lookups (`db.get()`) against the snapshot RocksDB instances to fetch the old and new states of keys identified in the delta SST files. For buckets with millions of keys, this random I/O degrades performance and thrashes the OS page cache. +**Optimized Design:** The optimization shifts mostly to sequential reads. For the Full Diff path, it uses native RocksDB **Table Iterators** to scan the entire `directoryTable` and `fileTable` sequentially. For the DAG-based path, it uses a **K-way Merge Iterator** over the delta SST files to sequentially extract the latest visible versions without needing to query the main snapshot DBs. This sequential I/O pattern maximizes disk throughput and cache efficiency. + +### 2.2. Lightweight Parsing +**Baseline Issue:** The baseline implementation fully deserializes `OmKeyInfo` and `OmDirectoryInfo` protobuf messages to compare objects, which is extremely CPU and memory intensive when scanning millions of keys. +**Optimized Design:** Introduces a lightweight `SnapshotDiffValueParser` that reads the raw protobuf byte stream directly. It extracts only the required fields (like `updateID`, `parentID`, `name` and compare signature fields) without instantiating full Java objects. It dynamically builds a compare signature by hashing only meaningful fields (content-change: latest block layout, size, `fileChecksum` and metadata-change: ACLs, metadata, tags), skipping volatile fields like `modificationTime` or `creationTime` to identify modified entries. + +#### Pseudo-code: Selective Parsing and Signature +```java +ParsedObjectInfo parseRequiredKeyInfo(byte[] raw, boolean meaningfulOnly) { + ParsedObjectInfo parsed = new ParsedObjectInfo(); + CodedInputStream input = CodedInputStream.newInstance(raw); + while (!input.isAtEnd()) { + int tag = input.readTag(); + switch (WireFormat.getTagFieldNumber(tag)) { + case KEYINFO_OBJECT_ID_FIELD: + parsed.setObjectId(input.readUInt64()); + break; + case KEYINFO_PARENT_ID_FIELD: + parsed.setParentId(input.readUInt64()); + break; + case KEYINFO_KEY_NAME_FIELD: + parsed.setName(input.readString()); + break; + case KEYINFO_UPDATE_ID_FIELD: + parsed.setUpdateId(input.readUInt64()); + break; + default: + input.skipField(tag); + break; + } + } + return parsed; +} + +ParsedObjectInfo parseSignatureKeyInfo(byte[] raw, boolean meaningfulOnly) { + ParsedObjectInfo parsed = new ParsedObjectInfo(); + CodedInputStream input = CodedInputStream.newInstance(raw); + while (!input.isAtEnd()) { + int tag = input.readTag(); + switch (WireFormat.getTagFieldNumber(tag)) { + case KEYINFO_METADATA_FIELD: + case KEYINFO_ACLS_FIELD: + case KEYINFO_TAGS_FIELD: + case KEYINFO_FILE_CHECKSUM_FIELD: + updateSignature(tag, input, parsed); + break; + case KEYINFO_BLOCK_LOCATIONS_FIELD: + updateSignature(extractLatestBlockInfo(tag, input), parsed); + default: + input.skipField(tag); + break; + } + } + return parsed; +} +``` + +### 2.3. UpdateID Gating +**Baseline Issue:** The baseline performs full object comparisons including timestamps to detect modifications, which is susceptible to clock skew and is computationally expensive. +**Optimized Design:** Uses the `dbTxSequenceNumber` of the `fromSnapshot` as a strict gate. During the `toSnapshot` scan, entries are only considered candidates for diff if their `updateID > fromSnapshotDbTxSequenceNumber`. Review Comment: updateId comes from Ratis `trxnLogIndex` , dbTxSequenceNumber is RocksDB concept. You sure those are comparable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
