Tejaskriya commented on code in PR #8178: URL: https://github.com/apache/ozone/pull/8178#discussion_r2032513894
########## hadoop-hdds/docs/content/design/aggressive-db-compaction-with-minimal-degradation.md: ########## @@ -0,0 +1,225 @@ +--- +title: Aggressive DB Compaction with Minimal Degradation +summary: Automatically compactRange on RocksDB with statistics of SST File +date: 2025-03-27 +jira: HDDS-12682 +status: accepted +author: Peter Lee +--- +<!--- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# Aggressive DB Compaction with Minimal Degradation + +## Short Introduction + +Use the `numEntries` and `numDeletion` in [TableProperties](https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/TableProperties.java#L12) which stores statistics for each SST as "guidance" to determine how to split tables into finer ranges for compaction. + +## Motivation + +Our current approach of compacting entire column families directly would significantly impact online performance through excessive write amplification. After researching TiKV and RocksDB compaction mechanisms, it's clear we need a more sophisticated solution that better balances maintenance operations with user workloads. + +TiKV runs background tasks for compaction and logically splits key ranges into table regions (with default size limits of 256MB per region), allowing gradual scanning and compaction of known ranges. While we can use the built-in `TableProperties` in SST files to check metrics like `num_entries` and `num_deletion`, these only represent operation counts without deduplicating keys. TiKV addresses this with a custom `MVCTablePropertiesCollector` for more accurate results, but unfortunately, the Java API doesn't currently support custom collectors, forcing us to rely on built-in statistics. + +For the Ozone Manager implementation, we face a different challenge since OM lacks the concept of size-based key range splits. The most logical division we can use is the bucket prefix (file table). For FSO buckets, we can further divide key ranges based on directory `parent_id`, enabling more granular and targeted compaction that minimizes disruption to ongoing operations. + +By implementing bucket-level compaction with proper paging mechanisms like `next_bucket` and potentially `next_parent_id` for directory-related tables, we can achieve more efficient storage utilization while maintaining performance. The Java APIs currently provide enough support to implement these ideas, making this approach viable for Ozone Manager. + +## Proposed Changes + +### RocksDB Java API Used + +- [`public Map<String, TableProperties> getPropertiesOfTablesInRange(final ColumnFamilyHandle columnFamilyHandle, final List<Range> ranges)`](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/RocksDB.java#L4575) + - Given a list of `Range`, returns a map of `TableProperties` in these ranges. +- [TableProperties](https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/TableProperties.java#L12) + - Statistical data for one SST file. +- [Range](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/Range.java) + - Contains one start [slice](https://javadoc.io/doc/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/Slice.html) and one end slice. + +### New Configuration Set + +Introduce four new configuration strings: +- `bucket_compact_check_interval`: Interval (ms) to check whether to start compaction for a region. +- `bucket_compact_max_entries_sum`: Upper bound of num_entries sum from all SST files in one compaction range. Default value is 1000000. +- `bucket_compact_tombstone_percentage`: Only compact range when `num_entries * tombstone_percentage / 100 <= num_deletion`. Default value is 30. +- `bucket_compact_min_tombstones`: Minimum number of tombstones to trigger manual compaction. Default value is 10000. + +### Create Compactor For Each Table + +Create new compactor instances for each table, including `KEY_TABLE`, `DELETED_TABLE`, `DELETED_DIR_TABLE`, `DIRECTORY_TABLE`, and `FILE_TABLE`. Run these background workers using a scheduled executor with configured interval and a random start time to spread out the workload. Review Comment: We can also include the multipartInfoTable here, for anyone using multipart uploads, this table also could cross the numEntries and numDeletes thresholds. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
