sollhui opened a new issue, #60616: URL: https://github.com/apache/doris/issues/60616
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Description ### Backgroud Currently, the number of Memtable flush threads in Doris needs to be manually adjusted by users/SRE/R&D personnel in different business load scenarios. This manual tuning brings heavy operational overhead, and improper configuration may lead to resource competition (CPU/IO bottlenecks), excessive small file generation, or memtable memory overflow risks, which seriously affects the stability and performance of the Doris cluster. Implement an **adaptive Memtable flush thread pool adjustment mechanism** that dynamically calculates and modifies the maximum concurrent flush thread count based on real-time cluster load metrics. The core design is optimized for Doris's **storage-compute integrated** and **storage-compute separated** deployment architectures. ### Core Design Details #### 1. Real-time Load Metric Collection Collect multi-dimensional metrics periodically to reflect the current system state (all metrics are atomic and thread-safe for collection): - Memtable total memory usage (monitor soft/hard memory limit thresholds) - Memtable flush task queue backlog size - Disk IO busy status (differentiated implementation for deployment architectures) - Storage-compute integrated: Judge by disk IO util metric - Storage-compute separated (S3/HDFS): Judge by the queue length of S3 write thread pool - CPU usage rate of BE nodes (avoid CPU context switch overhead caused by excessive flush threads) #### 2. Adaptive Flush Thread Count Calculation Add a dedicated calculation logic (executed every 1 minute) to dynamically adjust the base concurrent flush thread count, with **upper/lower limits** to avoid extreme values. The core judgment rules are: ```cpp int CalculateMaxConcurrentFlush() { // Condition 1: Memory reaches soft limit -> +1 if (_memory_limiter != nullptr && _memory_limiter->mem_usage() > 0) { base_concurrent = std::min(max_threads, base_concurrent + 1); } // Condition 2: Flush queue > 10 -> +1 int queue_size = _flush_pool->get_queue_size(); if (queue_size > kFlushQueueThreshold) { base_concurrent = std::min(max_threads, base_concurrent + 1); } // Condition 3: IO busy -> -1 // For compute-storage integrated: disk IO util > 90% // For compute-storage separated (cloud): S3 upload queue > threshold if (_is_io_busy()) { base_concurrent = std::max(min_threads, base_concurrent - 1); } // Condition 4: CPU usage > 90% -> -1 if (_is_cpu_busy()) { base_concurrent = std::max(min_threads, base_concurrent - 1); } ... } ``` #### 3. Flush Memtable Thread Count For the storage-compute separated deployment mode, where there is no direct disk IO interaction, disk metrics are no longer a consideration; we thus optimize and unify the calculation of thread count limits based on the number of CPU cores for both **storage-compute integrated** and **storage-compute separated** deployment architectures: - Minimum thread count: `num_cpus * config::min_flush_thread_num_per_cpu` (default: 1/2 per CPU) - Maximum thread count: `num_cpus * config::max_flush_thread_num_per_cpu` (default: 4 per CPU) #### 4. Bad Case and Mitigation - The mixed scenario (e.g., large query + data import) may lead to high CPU/IO usage, causing the number of flush threads to continuously decrease, eventually reaching a relatively low value. Set a **reasonable minimum thread count** to avoid continuous thread reduction and prevent flush task backlog. - Introducing a value that is not the smallest minimum may lead to a significant amount of idle flush thread pool when The write load is relatively low. #### 5. Extensible Class Design Introduce a new generic adaptive configuration class to manage the flush thread pool parameters, which supports future expansion for adaptive tuning of other write module parameters in Doris (ensures code reusability). ### Use case _No response_ ### Related issues _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
