devabhishekpal commented on code in PR #9793: URL: https://github.com/apache/ozone/pull/9793#discussion_r2867884325
########## hadoop-hdds/docs/content/design/mpu-gc-optimization.md: ########## @@ -0,0 +1,649 @@ +--- +title: Multipart Upload GC Pressure Optimizations +summary: Change Multipart Upload Logic to improve OM GC Pressure +date: 2026-02-19 +jira: HDDS-10611 +status: proposed +author: Abhishek Pal, Rakesh Radhakrishnan +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# Ozone MPU Optimization - Design Doc + + +## Table of Contents +1. [Motivation](#1-motivation) +2. [Proposal](#2-proposal) + * [Split-table design (V2)](#split-table-design-v2) + * [Comparison: V1 (legacy) vs V2](#comparison-v1-legacy-vs-v2) + * [2.1 Data Layout Changes](#21-data-layout-changes) + * [2.2 MPU Flow Changes](#22-mpu-flow-changes) + * [2.3 Summary and Trade-offs](#23-summary-and-trade-offs) +3. [Upgrades](#3-upgrades) +4. [Industry Patterns](#4-industry-patterns-flattened-keys-in-lsmrocksdb-systems) +--- + +## 1. Motivation +Presently Ozone has several overheads when uploading large files via Multipart upload (MPU). This document presents a detailed design for optimizing the MPU storage layout to reduce these overheads. + +### Problem with the current MPU schema +**Current design:** +* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}` +* Value = full `OmMultipartKeyInfo` with all parts inline. + +**Implications:** +1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, adds one part, serializes it, and writes it back (HDDS-10611). +<br> +``` +Side note: This is a common pattern in regular open key writes as well, but the MPU case is more severe due to the growing part list and more frequent updates. +``` +2. RocksDB WAL logs each full write → WAL growth (HDDS-8238). +3. GC pressure grows with the size of the object (HDDS-10611). + +#### a) Deserialization overhead +| Operation | Current | +|:--------------|:--------------------------------------------------------| +| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) | + +#### b) WAL overhead +Assuming one MPU part info object takes ~1.5KB. + +| Scenario | Current WAL | +|:------------|:--------------------------------| +| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB | + +#### c) GC pressure +Current: Large short-lived objects per part commit. + +#### Existing Storage Layout Overview +```protobuf +MultipartKeyInfo { + uploadID : string + creationTime : uint64 + type : ReplicationType + factor : ReplicationFactor (optional) + partKeyInfoList : repeated PartKeyInfo ← grows with each part + objectID : uint64 (optional) + updateID : uint64 (optional) + parentID : uint64 (optional) + ecReplicationConfig : optional +} +``` + +--- + +## 2. Proposal +The idea is to split the content of `MultipartInfoTable`. Part information will be stored separately in a flattened schema (one row per part) instead of one giant object. + +### Split-table design (V2) +Split MPU metadata into: +* **Metadata table:** Lightweight per-MPU metadata (no part list). +* **Parts table:** One row per part (flat structure). + +**New MultipartPartInfo Structure:** +```protobuf +message MultipartPartInfo { + required string partName = 1; + required uint32 partNumber = 2; + required string volumeName = 3; + required string bucketName = 4; + required string keyName = 5; + required uint64 dataSize = 6; + required uint64 modificationTime = 7; + repeated KeyLocationList keyLocationList = 8; + repeated hadoop.hdds.KeyValue metadata = 9; + optional FileEncryptionInfoProto fileEncryptionInfo = 10; + optional FileChecksumProto fileChecksum = 11; +} +``` + +### Comparison: V1 (legacy) vs V2 +| Metric | Current (V1) | Split-Table (V2) | +|:--------------------|:------------------------------|:-------------------------------------------------| +| **Commit part N** | Read + deserialize whole list | Read Metadata (~200B) + write single PartKeyInfo | +| **1,000 parts WAL** | ~733 MB | ~1.5 MB (or ~600KB with optimized info) | +| **GC Pressure** | Large short-lived objects | Small metadata + single-part objects | + +--- + +### 2.1 Data Layout Changes + +#### 2.1.1 Chosen Approach: Reuse `multipartInfoTable` + add `multipartPartsTable` + +Keep `multipartInfoTable` for MPU metadata, and store part rows in `multipartPartsTable`. + +**Storage Layout:** +* **`multipartInfoTable` (RocksDB):** + * V1: Key -> `OmMultipartKeyInfo` { parts inline } + * V2: Key -> `OmMultipartKeyInfo` { empty list, `schemaVersion: 1` } +* **`multipartPartsTable` (RocksDB):** + * Key type: `OmMultipartPartKey(uploadId, partNumber)` + * Value type: `OmMultipartPartInfo` + +**`multipartPartsTable` key codec (V2):** +* `OmMultipartPartKey` uses two logical fields: + * `uploadId` (`String`) + * `partNumber` (`int32`) +* Persisted key bytes are encoded as: + * `uploadId(UTF-8 bytes)` + `0x00` + `partNumber(4-byte big-endian int)` +* Prefix scan for all parts in one upload uses: + * `uploadId(UTF-8 bytes)` + `0x00` + +```text +Note: The null byte separator ensures that the "uploadId" is properly delimited from the "partNumber" in the byte encoding, allowing for correct lexicographical ordering. Review Comment: So the above approach adds the conversion overhead, one way I figured out is: Keep binary 4-byte suffix, and decode by checking full-key layout first `len-5` then prefix `len-1`. Say for the below sample input: ``` uploadId = <abcd-1234...> partNumber = 47 -> taking 47 because it has 2f in the byte representation ``` **Encoding format:** keyBytes = `UTF8(uploadId) + '/' + int32_be(partNumber)` For 47: - int32 big-endian bytes = `00 00 00 2f` So final tail is: - separator / = `2f` - part bytes = `00 00 00 2f` Total tail bytes = `2f 00 00 00 2f` Readable-ish representation= `<uploadId>/\x00\x00\x00\x2f` Now we run the following checks: - Check full-key first: if byte at `len - 5` is `/`, decode as full key (used to identify full row) - Else, if byte at `len - 1` is `/`, decode as prefix. (this is used for prefix scan for iterating) - Else invalid. This also skips extra hops for converting from `byte -> hex -> string`. The assumption here is that the key has sufficient length such that `len - 5` doesn't break. I **asked ChatGPT** for edge case on when this can break and it says that in normal operation this should not break. Below is it's explanation: #### Why valid keys always have enough length For full key (binary-tail design): - format is uploadId + '/' + 4 bytes - minimum valid size = 1 + 1 + 4 = 6 bytes (even uploadId length 1) For prefix key: - format is uploadId + '/' - minimum valid size = 1 + 1 = 2 bytes So any legit key generated by toPersistedFormat is long enough. #### When it can be too short Only if decoder sees invalid raw data, for example: - empty byte array - truncated/corrupted key bytes in DB/scan tool - wrong table decoded with wrong codec - manual/debug insertion of malformed key So the length check is not for normal runtime correctness; it’s for robustness and clean failure on bad bytes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
