smengcl commented on code in PR #10335:
URL: https://github.com/apache/ozone/pull/10335#discussion_r3313879767


##########
hadoop-hdds/docs/content/design/efficient-snapdiff.md:
##########
@@ -0,0 +1,252 @@
+---
+title: Snapshot Diff Optimization
+summary: Describe proposal for an optimized snapshot diff that uses mostly 
sequential reads and batch puts
+date: 2025-05-22
+jira: HDDS-9154
+status: draft
+author: Saketa Chalamchala
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+## 1. Introduction
+This document outlines the technical design, architectural choices, and 
algorithmic improvements to optimize Ozone's Snapshot Diff feature. The design 
addresses performance bottlenecks in both the **Full Diff** and **DAG-based 
Diff** paths. The primary goals are to reduce random I/O, minimize CPU overhead 
from deserialization, and streamline the classification of differences.
+
+ ## Goals
+ - Reduce random I/O.
+ - Minimize CPU cost of deserializing KeyInfo and DirectoryInfo for 
comparisons.
+ - Keep baseline diff semantics for CREATE/DELETE/RENAME/MODIFY where possible.
+
+---
+
+## 2. Core Design Choices & Optimizations
+
+### 2.1. Sequential Reads & Table Iterators
+**Baseline Issue:** Baseline full diff enumerates keys via SST readers (plus 
per-key `db.get` lookups), and the DAG-based diff relies heavily on random 
point lookups (`db.get()`) against the snapshot RocksDB instances to fetch the 
old and new states of keys identified in the delta SST files. For buckets with 
millions of keys, this random I/O degrades performance and thrashes the OS page 
cache.
+**Optimized Design:** The optimization shifts mostly to sequential reads. For 
the Full Diff path, it uses native RocksDB **Table Iterators** to scan the 
entire `directoryTable` and `fileTable` sequentially. For the DAG-based path, 
it uses a **K-way Merge Iterator** over the delta SST files to sequentially 
extract the latest visible versions without needing to query the main snapshot 
DBs. This sequential I/O pattern maximizes disk throughput and cache efficiency.
+
+### 2.2. Lightweight Parsing
+**Baseline Issue:** The baseline implementation fully deserializes `OmKeyInfo` 
and `OmDirectoryInfo` protobuf messages to compare objects, which is extremely 
CPU and memory intensive when scanning millions of keys.
+**Optimized Design:** Introduces a lightweight `SnapshotDiffValueParser` that 
reads the raw protobuf byte stream directly. It extracts only the required 
fields (like `updateID`, `parentID`, `name` and compare signature fields) 
without instantiating full Java objects. It dynamically builds a compare 
signature by hashing only meaningful fields (content-change: latest block 
layout, size, `fileChecksum` and metadata-change: ACLs, metadata, tags), 
skipping volatile fields like `modificationTime` or `creationTime` to identify 
modified entries.
+
+#### Pseudo-code: Selective Parsing and Signature
+```java
+ParsedObjectInfo parseRequiredKeyInfo(byte[] raw, boolean meaningfulOnly) {
+  ParsedObjectInfo parsed = new ParsedObjectInfo();
+  CodedInputStream input = CodedInputStream.newInstance(raw);
+  while (!input.isAtEnd()) {
+    int tag = input.readTag();
+    switch (WireFormat.getTagFieldNumber(tag)) {
+    case KEYINFO_OBJECT_ID_FIELD:
+      parsed.setObjectId(input.readUInt64());
+      break;
+    case KEYINFO_PARENT_ID_FIELD:
+      parsed.setParentId(input.readUInt64());
+      break;
+    case KEYINFO_KEY_NAME_FIELD:
+      parsed.setName(input.readString());
+      break;
+    case KEYINFO_UPDATE_ID_FIELD:
+      parsed.setUpdateId(input.readUInt64());
+      break;
+    default:
+      input.skipField(tag);
+      break;
+    }
+  }
+  return parsed;
+}
+
+ParsedObjectInfo parseSignatureKeyInfo(byte[] raw, boolean meaningfulOnly) {
+  ParsedObjectInfo parsed = new ParsedObjectInfo();
+  CodedInputStream input = CodedInputStream.newInstance(raw);
+  while (!input.isAtEnd()) {
+    int tag = input.readTag();
+    switch (WireFormat.getTagFieldNumber(tag)) {
+    case KEYINFO_METADATA_FIELD:
+    case KEYINFO_ACLS_FIELD:
+    case KEYINFO_TAGS_FIELD:
+    case KEYINFO_FILE_CHECKSUM_FIELD:
+        updateSignature(tag, input, parsed);
+      break;
+    case KEYINFO_BLOCK_LOCATIONS_FIELD:
+        updateSignature(extractLatestBlockInfo(tag, input), parsed);
+    default:
+      input.skipField(tag);
+      break;
+    }
+  }
+  return parsed;
+}
+```
+
+### 2.3. UpdateID Gating
+**Baseline Issue:** The baseline performs full object comparisons including 
timestamps to detect modifications, which is susceptible to clock skew and is 
computationally expensive.
+**Optimized Design:** Uses the `dbTxSequenceNumber` of the `fromSnapshot` as a 
strict gate. During the `toSnapshot` scan, entries are only considered 
candidates for diff if their `updateID > fromSnapshotDbTxSequenceNumber`.

Review Comment:
   updateId comes from Ratis `trxnLogIndex` , dbTxSequenceNumber is RocksDB 
concept. You sure those are comparable? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to