ptlrs commented on code in PR #8405: URL: https://github.com/apache/ozone/pull/8405#discussion_r2103702273
########## hadoop-hdds/docs/content/design/degraded-storage-volumes.md: ########## @@ -0,0 +1,212 @@ +--- +title: Improved Storage Volume Handling for Ozone Datanodes +summary: Proposal to add a degraded storage volume health state in datanodes. +date: 2025-05-06 +jira: HDDS-8387 +status: draft +author: Ethan Rose, Rishabh Patel +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# Improved Storage Volume Handling for Ozone Datanodes + +## Background + +Currently Ozone uses two health states for storage volumes: **healthy** and **failed**. A volume scanner runs on each datanode to determine whether a volume should be moved from a **healthy** to a **failed** state. Once a volume is failed, all container replicas on that volume are removed from tracking by the datanode and considered lost. Volumes cannot return to a healthy state after failure without a datanode restart. + +This model only works for hard failures in volumes, but in practice most volume failures are soft failures. Disk issues manifest in a variety of ways and minor problems usually appear before a drive fails completely. The current approach to volume scanning and health classification does not account for this. If a volume is starting to exhibit signs of failure, the datanode only has two options: +- Fail the volume + - In many cases the volume may still be mostly or partially readable. Containers on this volume that were still readable would be removed by the system and have their redundancy reduced unecessarily. This is not a safe operation. +- Keep the volume healthy + - Containers on this volume will not have extra copies made until the container scanner finds corruption and marks them unhealthy, after which we have already lost redundancy. + +For the common case of soft volume failures, neither of these are good options. This document outlines a proposal to classify and handle soft volume failures in datanodes. Review Comment: When the first on-demand container scan is triggered, we could speed up the degraded/failed state detection of a disk by throttling up the background volume scanner. This would reduce the time required to satisfy the sliding window criteria at the expense of operational reads and increased IO. The durability of data is a priority. One of the points discussed was that the replication manager changes required for acting upon a degraded volume would align with the changes required for a volume-decommissioning feature. As a result, this proposal suggests taking on the replication manager changes as the next step. An alternative would be to first have a simplified detection of the degraded state and improve the existing replication manager's actions to consider the new degraded volume state when replicating. Improving the detection of degraded state and decommissioning of volumes could be done at a later stage. What do you think @errose28? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
