errose28 commented on code in PR #8405: URL: https://github.com/apache/ozone/pull/8405#discussion_r2110057316
########## hadoop-hdds/docs/content/design/degraded-storage-volumes.md: ########## @@ -0,0 +1,275 @@ +--- +title: Improved Storage Volume Handling for Ozone Datanodes +summary: Proposal to add a degraded storage volume health state in datanodes. +date: 2025-05-06 +jira: HDDS-8387 +status: draft +author: Ethan Rose, Rishabh Patel +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# Improved Storage Volume Handling for Ozone Datanodes + +## Background + +Currently Ozone uses two health states for storage volumes: **healthy** and **failed**. A volume scanner runs on each datanode to determine whether a volume should be moved from a **healthy** to a **failed** state. Once a volume is failed, all container replicas on that volume are removed from tracking by the datanode and considered lost. Volumes cannot return to a healthy state after failure without a datanode restart. + +This model only works for hard failures in volumes, but in practice most volume failures are soft failures. Disk issues manifest in a variety of ways and minor problems usually appear before a drive fails completely. The current approach to volume scanning and health classification does not account for this. If a volume is starting to exhibit signs of failure, the datanode only has two options: +- Fail the volume + - In many cases the volume may still be mostly or partially readable. Containers on this volume that were still readable would be removed by the system and have their redundancy reduced unnecessarily. This is not a safe operation. +- Keep the volume healthy + - Containers on this volume will not have extra copies made until the container scanner finds corruption and marks them unhealthy, after which we have already lost redundancy. + +For the common case of soft volume failures, neither of these are good options. This document outlines a proposal to classify and handle soft volume failures in datanodes. + +## Proposal + +This document proposes adding a new volume state called **degraded**, which will correspond to partially failed volumes. Handling degraded volumes can be broken into two problems: +- **Identification**: Detecting degraded volumes and alerting via metrics and reports to SCM and Recon +- **Remediation**: Proactively making copies of data on degraded volumes and preventing new writes before the volume completely fails + +This document is primarily focused on identification, and proposes handling remediation with a volume decommissioning feature that can be implemented independently of volume health state. + +### Tools to Identify Volume Health State + +Ozone has access to the following checks from the volume scanner to determine volume health. Most of these checks are already present. + +#### Directory Check + +This check verifies that a directory exists at the specified location for the volume, and that the datanode has read, write, and execute permissions on the directory. + +#### Database Check + +This check only applies to container data volumes (called `HddsVolumes` in the code). It checks that a new read handle can be acquired for the RocksDB instance on that volume, in addition to the write handle the process is currently holding. It does not use any RocksDB APIs that do individual SST file checksum validation, like paranoid checks. corruption within individual SST files will only affect the keys in those files, and RocksDB verifies checksums for individual keys on each read. This makes SST file checksum errors isolated to a per-container level and they will be detected by the container scanner and cause the container to be marked unhealthy. + +#### File Check + +This check runs the following steps: +1. Generates a fixed amount of data and keeps it in memory +2. Writes the data to a file on the disk +3. Syncs the file to the disk to touch the hardware +4. Reads the file back to ensure the contents match what was in memory +5. Deletes the file + +Of these, the file sync is the most important check, because it ensures that the disk is still reachable. This detects a dangerous condition where the disk is no longer present, but data remains readable and even writeable (if sync is not used) due to in-memory caching by the OS and file system. The cached data may cease to be reachable at any time, and should not be counted as valid replicas of the data. + +#### IO Error Count + +This would be a new check that can be used as part of this feature. Currently each time datanode IO encounters an error, we request an on-demand volume scan. This should include every time the container scanner marks a container unhealthy. We can keep a counter of how many IO errors have been reported on a volume over a given time frame, regardless of whether the corresponding volume scan passed or failed. This accounts for cases that may show up on the main IO path but may otherwise not be detected by the volume scanner. For example, numerous old sectors with existing container data may be unreadable. The volume scanner's **File Check** will only utilize new disk sectors so it will still pass with these errors present, but the container scanner may be hitting many bad sectors across containers, which this check will account for. + +#### Sliding Window + +Most checks will encounter intermittent issues, even on overall healthy drives, so we should not downgrade volume health state after just one error. The current volume scanner uses a counter based sliding window for intermittent failues, meaning the volume will be failed if `x` out of the last `y` checks failed, regardless of when they occurred. This approach works for background volume scans, because `y` is the number of times the check ran, and `x` is the number of times it failed. It does not work if we want to apply a sliding window to on-demand checks like IO error count that do not care if the corresponding volume scan passed or failed. +To handle this, we can switch to time based sliding windows to determine when a threshold of tolerable errors is crossed. For example, if this check has failed `x` times in the last `y` minutes, we should consider the volume degraded. + +We can use one time based sliding window to track errors that would cause a volume to be degraded, and a second one for errors that would cause a volume to be failed. When a check fails, it can add the result to whichever sliding window it corresponds to. We can create the following assignments of checks: + +- **Directory Check**: No sliding window required. If the volume is not present based on filesystem metadata it should be failed immediately. +- **Database Check**: On failure, add an entry to the **failed health sliding window** +- **File Check**: + - If the sync portion of the check fails, add an entry to the **failed health sliding window** + - If any other part of this check fails, add an entry to the **degraded health sliding window** +- **IO Error Count**: When an on-demand volume scan is requested, add an entry to the **degraded health sliding window** Review Comment: > This is how the container scanner informs the volume scanner of a problem, correct? We currently have code that triggers on-demand volume scans in `catch` block of most IO operations. It is currently missing from when the container scanner marks a container unhealthy but we should add it since that is also an IO error. > Why do we need a sliding window? If the total of X I/O errors was reported, decide to fail it. Sliding window to me makes a decision to prioritize errors based on time, but then it is complicated to implement, instead, a threshold is a simple measure. There is still a time based component in this suggestion: datanode uptime. A very long running datanode will eventually hit X even on a healthy volume. Creating a fixed time in the sliding window normalizes for this. > Even in this case, how do you decide what is X? What heuristic guides this decision? This is a tricky problem, and I'm not sure I have a good heuristic right now. But we should note it is not unique to this proposal. Even the current volume scanner uses a counter based sliding window where 2/3 of the last checks must have passed to fail a volume. The only other option is to fail a volume on a single IO error which would be too aggressive. Even a healthy disk is going to have some IO bumps occasionally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
