Re: [PR] HDDS-12712. Design Document of "Aggressive DB Compaction with Minimal Degradation" [ozone]

via GitHub Thu, 10 Apr 2025 15:56:27 -0700


Tejaskriya commented on code in PR #8178:
URL: https://github.com/apache/ozone/pull/8178#discussion_r2032513894



##########
hadoop-hdds/docs/content/design/aggressive-db-compaction-with-minimal-degradation.md:
##########
@@ -0,0 +1,225 @@
+---
+title: Aggressive DB Compaction with Minimal Degradation
+summary: Automatically compactRange on RocksDB with statistics of SST File
+date: 2025-03-27
+jira: HDDS-12682
+status: accepted
+author: Peter Lee
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# Aggressive DB Compaction with Minimal Degradation
+
+## Short Introduction
+
+Use the `numEntries` and `numDeletion` in 
[TableProperties](https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/TableProperties.java#L12)
 which stores statistics for each SST as "guidance" to determine how to split 
tables into finer ranges for compaction.
+
+## Motivation
+
+Our current approach of compacting entire column families directly would 
significantly impact online performance through excessive write amplification. 
After researching TiKV and RocksDB compaction mechanisms, it's clear we need a 
more sophisticated solution that better balances maintenance operations with 
user workloads.
+
+TiKV runs background tasks for compaction and logically splits key ranges into 
table regions (with default size limits of 256MB per region), allowing gradual 
scanning and compaction of known ranges. While we can use the built-in 
`TableProperties` in SST files to check metrics like `num_entries` and 
`num_deletion`, these only represent operation counts without deduplicating 
keys. TiKV addresses this with a custom `MVCTablePropertiesCollector` for more 
accurate results, but unfortunately, the Java API doesn't currently support 
custom collectors, forcing us to rely on built-in statistics.
+
+For the Ozone Manager implementation, we face a different challenge since OM 
lacks the concept of size-based key range splits. The most logical division we 
can use is the bucket prefix (file table). For FSO buckets, we can further 
divide key ranges based on directory `parent_id`, enabling more granular and 
targeted compaction that minimizes disruption to ongoing operations.
+
+By implementing bucket-level compaction with proper paging mechanisms like 
`next_bucket` and potentially `next_parent_id` for directory-related tables, we 
can achieve more efficient storage utilization while maintaining performance. 
The Java APIs currently provide enough support to implement these ideas, making 
this approach viable for Ozone Manager.
+
+## Proposed Changes
+
+### RocksDB Java API Used
+
+- [`public Map<String, TableProperties> getPropertiesOfTablesInRange(final 
ColumnFamilyHandle columnFamilyHandle, final List<Range> 
ranges)`](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/RocksDB.java#L4575)
+    - Given a list of `Range`, returns a map of `TableProperties` in these 
ranges.
+- 
[TableProperties](https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/TableProperties.java#L12)
+    - Statistical data for one SST file.
+- 
[Range](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/Range.java)
+    - Contains one start 
[slice](https://javadoc.io/doc/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/Slice.html)
 and one end slice.
+
+### New Configuration Set
+
+Introduce four new configuration strings:
+- `bucket_compact_check_interval`: Interval (ms) to check whether to start 
compaction for a region.
+- `bucket_compact_max_entries_sum`: Upper bound of num_entries sum from all 
SST files in one compaction range. Default value is 1000000.
+- `bucket_compact_tombstone_percentage`: Only compact range when `num_entries 
* tombstone_percentage / 100 <= num_deletion`. Default value is 30.
+- `bucket_compact_min_tombstones`: Minimum number of tombstones to trigger 
manual compaction. Default value is 10000.
+
+### Create Compactor For Each Table
+
+Create new compactor instances for each table, including `KEY_TABLE`, 
`DELETED_TABLE`, `DELETED_DIR_TABLE`, `DIRECTORY_TABLE`, and `FILE_TABLE`. Run 
these background workers using a scheduled executor with configured interval 
and a random start time to spread out the workload.

Review Comment:
   We can also include the multipartInfoTable here, for anyone using multipart 
uploads, this table also could cross the numEntries and numDeletes thresholds.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-12712. Design Document of "Aggressive DB Compaction with Minimal Degradation" [ozone]

Reply via email to