mango-li commented on code in PR #9793: URL: https://github.com/apache/ozone/pull/9793#discussion_r2958457358
########## hadoop-hdds/docs/content/design/mpu-gc-optimization.md: ########## @@ -0,0 +1,668 @@ +--- +title: Multipart Upload GC Pressure Optimizations +summary: Change Multipart Upload Logic to improve OM GC Pressure +date: 2026-02-19 +jira: HDDS-10611 +status: proposed +author: Abhishek Pal, Rakesh Radhakrishnan +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# Ozone MPU Optimization - Design Doc + + +## Table of Contents +1. [Motivation](#1-motivation) +2. [Proposal](#2-proposal) + * [Split-table design (V2)](#split-table-design-v2) + * [Comparison: V1 (legacy) vs V2](#comparison-v1-legacy-vs-v2) + * [2.1 Data Layout Changes](#21-data-layout-changes) + * [2.2 MPU Flow Changes](#22-mpu-flow-changes) + * [2.3 Summary and Trade-offs](#23-summary-and-trade-offs) +3. [Upgrades](#3-upgrades) +4. [Industry Patterns](#4-industry-patterns-flattened-keys-in-lsmrocksdb-systems) +--- + +## 1. Motivation +Presently Ozone has several overheads when uploading large files via Multipart upload (MPU). This document presents a detailed design for optimizing the MPU storage layout to reduce these overheads. + +### Problem with the current MPU schema +**Current design:** +* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}` +* Value = full `OmMultipartKeyInfo` with all parts inline. + +**Implications:** +1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, adds one part, serializes it, and writes it back (HDDS-10611). +<br> +``` +Side note: This is a common pattern in regular open key writes as well, but the MPU case is more severe due to the growing part list and more frequent updates. +``` +2. RocksDB WAL logs each full write → WAL growth (HDDS-8238). +3. GC pressure grows with the size of the object (HDDS-10611). + +#### a) Deserialization overhead +| Operation | Current | +|:--------------|:--------------------------------------------------------| +| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) | + +#### b) WAL overhead +Assuming one MPU part info object takes ~1.5KB. + +| Scenario | Current WAL | +|:------------|:--------------------------------| +| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB | + +#### c) GC pressure +Current: Large short-lived objects per part commit. + +#### Existing Storage Layout Overview +```protobuf +MultipartKeyInfo { + uploadID : string + creationTime : uint64 + type : ReplicationType + factor : ReplicationFactor (optional) + partKeyInfoList : repeated PartKeyInfo ← grows with each part + objectID : uint64 (optional) + updateID : uint64 (optional) + parentID : uint64 (optional) + ecReplicationConfig : optional +} +``` + +--- + +## 2. Proposal +The idea is to split the content of `MultipartInfoTable`. Part information will be stored separately in a flattened schema (one row per part) instead of one giant object. + +### Split-table design (V2) +Split MPU metadata into: +* **Metadata table:** Lightweight per-MPU metadata (no part list). +* **Parts table:** One row per part (flat structure). + +**New MultipartPartInfo Structure:** +```protobuf +message MultipartPartInfo { + optional string partName = 1; + optional uint32 partNumber = 2; + optional string eTag = 3; + optional KeyLocationList keyLocationList = 4; + optional uint64 dataSize = 5; + optional uint64 modificationTime = 6; + optional uint64 objectID = 7; + optional uint64 updateID = 8; + optional FileEncryptionInfoProto fileEncryptionInfo = 9; + optional FileChecksumProto fileChecksum = 10; +} +``` + +``` +Note: Here we are setting all fields to optional because Protobuf states that required field should be enforced in the application level. Also proto3 doesn't support required fields. +``` + +### Comparison: V1 (legacy) vs V2 +| Metric | Current (V1) | Split-Table (V2) | +|:--------------------|:------------------------------|:-------------------------------------------------| +| **Commit part N** | Read + deserialize whole list | Read Metadata (~200B) + write single PartKeyInfo | +| **1,000 parts WAL** | ~733 MB | ~1.5 MB (or ~600KB with optimized info) | +| **GC Pressure** | Large short-lived objects | Small metadata + single-part objects | + +--- + +### 2.1 Data Layout Changes + +#### 2.1.1 Chosen Approach: Reuse `multipartInfoTable` + add `multipartPartsTable` + +Keep `multipartInfoTable` for MPU metadata, and store part rows in `multipartPartsTable`. + +**Storage Layout:** +* **`multipartInfoTable` (RocksDB):** + * V1: Key -> `OmMultipartKeyInfo` { parts inline } + * V2: Key -> `OmMultipartKeyInfo` { empty list, `schemaVersion: 1` } +* **`multipartPartsTable` (RocksDB):** + * Key type: `OmMultipartPartKey(uploadId, partNumber)` + * Value type: `OmMultipartPartInfo` + +**`multipartPartsTable` key codec (V2):** +* `OmMultipartPartKey` uses two logical fields: + * `uploadId` (`String`) + * `partNumber` (`int32`) +* Persisted key bytes are encoded as: + * `uploadId(UTF-8 bytes)` + `'/' (0x2f)` + `partNumber(4-byte big-endian int)` +* Prefix scan for all parts in one upload uses: + * `uploadId(UTF-8 bytes)` + `'/' (0x2f)` + +```text +`OmMultipartPartKey.toString()` returns: + - full key: "<uploadId>/<partNumber>" + - prefix key: "<uploadId>" (used only as in-memory prefix object) + +Example: + OmMultipartPartKey.of("abc123-uuid-456", 2).toString() == "abc123-uuid-456/2" +``` + +The parts are stored in lexicographical order by uploadID and part number, which complies with the S3 specifications for ordering of ListPart and ListMultipartUpload operations. + +#### MultipartKeyInfo Structure +```protobuf +message MultipartKeyInfo { + required string uploadID = 1; + required uint64 creationTime = 2; + required hadoop.hdds.ReplicationType type = 3; + optional hadoop.hdds.ReplicationFactor factor = 4; + repeated PartKeyInfo partKeyInfoList = 5; [deprecated = true] + optional uint64 objectID = 6; + optional uint64 updateID = 7; + optional uint64 parentID = 8; + optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 9; + optional uint32 schemaVersion = 10; // default 0 + // this is being pull up from the part information as this wil not change per part for a given key + optional string volumeName = 11; + optional string bucketName = 12; + optional string keyName = 13; + optional string ownerName = 14; + repeated OzoneAclInfo acls = 15; +} +``` + +##### V1: `OmMultipartKeyInfo` (parts inline) +``` +OmMultipartKeyInfo { + uploadID + creationTime + type + factor + partKeyInfoList: [ PartKeyInfo, PartKeyInfo, ... ] <- all parts inline + objectID + updateID + parentID + schemaVersion: 0 (or absent) +} +``` + +##### V2: `OmMultipartKeyInfo` (empty list + schemaVersion) +``` +OmMultipartKeyInfo { + uploadID + creationTime + type + factor + partKeyInfoList: [] <- empty + objectID + updateID + parentID + schemaVersion: 1 +} +``` + +##### Example (for a 10-part MPU) + +`multipartInfoTable`: +``` +Key: /vol1/bucket1/mp_file1/abc123-uuid-456 + +Value: +OmMultipartKeyInfo { + uploadID: "abc123-uuid-456" + creationTime: 1738742400000 + type: RATIS + factor: THREE + partKeyInfoList: [] + objectID: 1001 + updateID: 12345 + parentID: 0 + schemaVersion: 1 +} +``` + +`multipartPartsTable` (logical keys): +```text +Key: OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=1} + String form: "abc123-uuid-456/1" +Value: OmMultipartPartInfo{partNumber=1, partName=".../part1", ...} + +Key: OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=2} + String form: "abc123-uuid-456/2" +Value: OmMultipartPartInfo{partNumber=2, partName=".../part2", ...} +... +Key: OmMultipartPartKey{uploadId="abc123-uuid-456", partNumber=10} + String form: "abc123-uuid-456/10" +Value: OmMultipartPartInfo{partNumber=10, partName=".../part10", ...} +``` + +`multipartPartsTable` (encoded key sample): +```text +uploadId = "abc123-uuid-456" +partNumber = 2 + +encodedKey = [61 62 63 31 32 33 2d 75 75 69 64 2d 34 35 36 2f 00 00 00 02] + [--------------uploadId UTF-8---------------][2f][-int32 BE-] +``` + +#### 2.1.2 Alternative Approach: Add `multipartMetadataTable` + `multipartPartsTable` + +Split metadata and introduce two new tables: +* **`multipartMetadataTable`**: lightweight per-MPU metadata (no part list). +* **`multipartPartsTable`**: one row per part (no aggregation). + +```protobuf +message MultipartMetadataInfo { + required string uploadID = 1; + required uint64 creationTime = 2; + required hadoop.hdds.ReplicationType type = 3; + optional hadoop.hdds.ReplicationFactor factor = 4; + optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 5; + optional uint64 objectID = 6; + optional uint64 updateID = 7; + optional uint64 parentID = 8; + optional uint32 schemaVersion = 9; // default 0 +} +``` + +**Storage Layout Overview:** +* **`multipartInfoTable` (RocksDB):** + * V1: `/vol/bucket/key/uploadId` -> `OmMultipartKeyInfo { partKeyInfoList: [...] }` +* **`multipartMetadataTable` (RocksDB):** + * V2: `/vol/bucket/key/uploadId` -> `MultipartMetadata { schemaVersion: 1 }` +* **`multipartPartsTable` (RocksDB):** + * Key: `OmMultipartPartKey(uploadId, partNumber)` + * Value: `PartKeyInfo`-equivalent part payload + +```protobuf +message MultipartMetadata { + required string uploadID = 1; + required uint64 creationTime = 2; + required hadoop.hdds.ReplicationType type = 3; + optional hadoop.hdds.ReplicationFactor factor = 4; + optional uint64 objectID = 5; + optional uint64 updateID = 6; + optional uint64 parentID = 7; + optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 8; + optional uint32 schemaVersion = 9; + // NO partKeyInfoList - moved to new table +} +``` + +Example: +``` +Key: /vol1/bucket1/mp_file1/abc123-uuid-456 + +Value: +MultipartMetadata { + uploadID: "abc123-uuid-456" + creationTime: 1738742400000 + type: RATIS + factor: THREE + objectID: 1001 + updateID: 12345 + parentID: 0 + schemaVersion: 1 +} +``` + +### 2.2 MPU Flow Changes + +#### 2.2.1 Chosen Approach Flow Changes + +##### Multipart Upload Initiate + +**Old Flow** +* Create `multipartKey = /{vol}/{bucket}/{key}/{uploadId}`. +* Build `OmMultipartKeyInfo` (schema default/legacy, inline `partKeyInfoList` model). +* Write: + * `openKeyTable[multipartKey] = OmKeyInfo` + * `multipartInfoTable[multipartKey] = OmMultipartKeyInfo` + +Example: +```text +multipartInfoTable[/vol1/b1/fileA/upload-001] -> + OmMultipartKeyInfo{schemaVersion=0, partKeyInfoList=[]} +openKeyTable[/vol1/b1/fileA/upload-001] -> + OmKeyInfo{key=fileA, objectID=9001} +``` + +**New Flow** +* Same keys/tables as old flow, but initiate sets `schemaVersion` explicitly: + * `schemaVersion=1` when `OMLayoutFeature.MPU_PARTS_TABLE_SPLIT` is allowed. + * `schemaVersion=0` otherwise. +* No part row is created at initiate time; part rows are created during commit-part. +* FSO response path (`S3InitiateMultipartUploadResponseWithFSO`) still writes parent directory entries, then open-file + multipart-info rows. +* Backward compatibility: write path selection is schema-based and layout-gated (see [3.1 Backward compatibility and layout gating](#31-backward-compatibility-and-layout-gating)). + +Example: +```text +multipartInfoTable[/vol1/b1/fileA/upload-001] -> + OmMultipartKeyInfo{schemaVersion=1, partKeyInfoList=[]} +``` + +##### Multipart Upload Commit Part + +**Old Flow** +* Read `multipartInfoTable[multipartKey]`. +* Read current uploaded part blocks from `openKeyTable[getOpenKey(..., clientID)]`. +* Insert in inline map: + * `oldPart = multipartKeyInfo.getPartKeyInfo(partNumber)` + * `multipartKeyInfo.addPartKeyInfo(currentPart)` +* Delete committed one-shot open key for this part. +* Update quota based on overwrite delta. + +Example: +```text +Before: partKeyInfoList=[{part=1,size=64MB},{part=2,size=32MB}] +Commit part 2 size=40MB +After: partKeyInfoList=[{part=1,size=64MB},{part=2,size=40MB}] +``` + +**New Flow** +* Load `multipartKeyInfo` and validate layout gate: + * if split feature is not allowed and `schemaVersion != 0`, fail early. +* Branch by schema: + * `schemaVersion=0`: same old inline behavior. + * `schemaVersion=1`: + * create `multipartPartKey = OmMultipartPartKey(uploadId, partNumber)`, + * write `multipartPartTable[multipartPartKey] = OmMultipartPartInfo{openKey, partName, partNumber, dataSize, modificationTime, objectID, updateID, metadata, keyLocationList, fileEncryptionInfo?, fileChecksum?}` + * keep current part open key in `openKeyTable` (needed later by list/complete/abort), + * if overwriting an existing part row, delete old part open key and adjust quota. +* `multipartInfoTable[multipartKey]` is still updated for metadata/updateID. Review Comment: If updating the multipartInfoTable is still necessary here, can we retain some information from the partKeyInfoList, such as the part name and part number, but remove the partKeyInfo? This way, future requests can avoid scanning the multipartPartTable and only need to query a single key. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
