Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-11-18 Thread via GitHub


github-actions[bot] closed pull request #8405: HDDS-8387. Improved Storage 
Volume Handling in Datanodes
URL: https://github.com/apache/ozone/pull/8405


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-11-18 Thread via GitHub


github-actions[bot] commented on PR #8405:
URL: https://github.com/apache/ozone/pull/8405#issuecomment-3549966391

   Thank you for your contribution. This PR is being closed due to inactivity. 
If needed, feel free to reopen it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-11-11 Thread via GitHub


github-actions[bot] commented on PR #8405:
URL: https://github.com/apache/ozone/pull/8405#issuecomment-3519267943

   This PR has been marked as stale due to 21 days of inactivity. Please 
comment or remove the stale label to keep it open. Otherwise, it will be 
automatically closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-07-16 Thread via GitHub


errose28 commented on PR #8405:
URL: https://github.com/apache/ozone/pull/8405#issuecomment-3080962638

   Sorry for the delay. What we are currently working on in HDDS-8387 and 
HDDS-13094 is increased observability around volume failures. These are mostly 
straightforward items mentioned in this doc that don't change how failed 
volumes are identified. After these gaps are filled in, we can begin looking at 
degraded volumes again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-06-25 Thread via GitHub


siddhantsangwan commented on PR #8405:
URL: https://github.com/apache/ozone/pull/8405#issuecomment-3006938506

   Are we going to proceed with this, or is this paused for now? @errose28 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-29 Thread via GitHub


slfan1989 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2114813429


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume sca

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-29 Thread via GitHub


slfan1989 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2114806961


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume sca

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-29 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2114436190


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scan

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110392092


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scan

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110088957


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,191 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Identification of Degraded Volumes
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner.

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110085369


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scan

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110076247


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scan

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110076247


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scan

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110070742


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scan

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110057316


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scan

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110034239


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scan

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110031200


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scan

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-27 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110024650


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.

Review Comment:
   > So here is a situation: I hit a bad sector, and an IO error is reported, 
which triggers an on-demand scan: the value of X is incremented. Now, in the 
current behavior, RM replicates the good replicas from other sources 
immediately. So, full durability is restored by the system.
   > With the proposed model, I have compromised durability because until my 
window length of (x-y) is hit, my container has only 2 good copies elsewhere. 
   
   This would still happen in the proposed model. There are no proposed changes 
to replication manager or container states in this document. I think there is 
some confusion between the on-demand container scanner and on-demand volume 
scanners here as well. On-demand container scanner will be triggered when a bad 
sector is read within the container, and if that fails it will mark the 
container unhealthy triggering the normal replication process. There is no 
sliding window for the on-demand container scanner.
   
   What is proposed in this doc is that if the on-demand container scanner 
marks a container unhealthy, it should also trigger an on-demand volume scan. 
For each on-demand volume scan requested, it would add a counter towards the 
degraded state sliding window of that volume.
   
   > Instead, a more desirable situation is if X = 1, degraded volume has the 
last copy of the container, RM replicated from this as the source, rest of the 
behavior is left identical.
   
   If there is only one copy of a container then it is already under-replicated 
and RM will copy from this volume as long as it is not failed. This doc does 
not propose any changes here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-25 Thread via GitHub


sumitagrawl commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2106222096


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume s

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-23 Thread via GitHub


slfan1989 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2105518743


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume sca

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-23 Thread via GitHub


slfan1989 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2105515167


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume sca

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-22 Thread via GitHub


ptlrs commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2103702273


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.

Review Comment:
   When the first on-demand container scan is triggered, we could speed up the 
degraded/failed state detection of a disk by throttling up the background 
volume scanner. This would reduce the time required to satisfy the sliding 
window criteria at the expense of operational reads and increased IO.  
   
   The durability of data is a priority. One of the points discussed was that 
the replication manager changes required for acting upon a degraded volume 
would align with the changes required for a volume-decommissioning feature. As 
a result, this proposal suggests taking on the replication manager changes as 
the next step. 
   
   An alternative would be to first have a simplified detection of the degraded 
state and improve the existing replication manager's actions to consider the 
new degraded volume state when replicating. Improving the detection of degraded 
state and decommissioning of volumes could be done at a later stage. What do 
you think @errose28?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-22 Thread via GitHub


swagle commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2103619863


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.

Review Comment:
   So here is a situation: I hit a bad sector, and an IO error is reported, 
which triggers an on-demand scan: the value of X is incremented. Now, in the 
current behavior, RM replicates the good replicas from other sources 
immediately. So, full durability is restored by the system.
   With the proposed model, I have compromised durability because until my 
window length of (x-y) is hit, my container has only 2 good copies elsewhere. 
Instead, a more desirable situation is if X = 1, degraded volume has the last 
copy of the container, RM replicated from this as the source, rest of the 
behavior is left identical. That increases the overall durability of the system 
even more than what is available today. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-22 Thread via GitHub


swagle commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2103550242


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanne

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-22 Thread via GitHub


swagle commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2103550242


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanne

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-22 Thread via GitHub


swagle commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2103550242


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanne

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-21 Thread via GitHub


ptlrs commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2101580864


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-21 Thread via GitHub


ptlrs commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2101580678


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-20 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2098864249


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.

Review Comment:
   Yes there is mapping of which checks correspond to which sliding window 
defined in the **Sliding Window** section, and then when the threshold of the 
window is crossed, the state is changed. Defining the specific thresholds for 
the windows is going to take some thought, so for now I've left that detail to 
one of the tasks in the **Task Breakdown** section. If we are able to decide on 
this earlier we can specify the initial recommendation in the doc as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-20 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2098822714


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Identification of Degraded Volumes
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner.

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-20 Thread via GitHub


Copilot commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2098585829


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.

Review Comment:
   The word 'unecessarily' is misspelled; consider replacing it with 
'unnecessarily'.
   ```suggestion
   - In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
   ```



##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-20 Thread via GitHub


ChenSammi commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2097079285


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.

Review Comment:
   I found these two. Do I miss anything else? 
   
   ```
   Failure threshold of the **degraded volume sliding window** is crossed.
   Failure threshold of the **failed volume sliding window** is crossed.
   ```
   What will be the recommended(default) thresholds value for degraded and 
failed state, and what will be the default slide window duration mentioned? 
Also an explanation of why we choose these default value is helpful. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-19 Thread via GitHub


ChenSammi commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2097079285


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.

Review Comment:
   ```
   Failure threshold of the **degraded volume sliding window** is crossed.
   Failure threshold of the **failed volume sliding window** is crossed.
   ```
   What will be the recommended(default) thresholds value for degraded and 
failed state, and what will be the default slide window duration mentioned? 
Also an explanation of why we choose these default value is helpful. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-19 Thread via GitHub


ChenSammi commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2096996278


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Identification of Degraded Volumes
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-19 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2096597589


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Identification of Degraded Volumes
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner.

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-19 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2096585205


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Identification of Degraded Volumes
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner.

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-19 Thread via GitHub


errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2096583889


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.

Review Comment:
   Specifications are provided later in the doc. Do you still have questions 
after finishing the document?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-19 Thread via GitHub


ChenSammi commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2095411078


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.

Review Comment:
   Can you explain a little bit more about what kind of failure is categorized 
as hard failure, and what kind of failure will be treated as soft failure?  
Some examples will be helpful with the understanding of goal of this proposal. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-19 Thread via GitHub


ChenSammi commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2095401002


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Identification of Degraded Volumes
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-19 Thread via GitHub


ChenSammi commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2095396822


##
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+- In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+- Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Identification of Degraded Volumes
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+ Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+ Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+ File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+ IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-08 Thread via GitHub


errose28 commented on PR #8405:
URL: https://github.com/apache/ozone/pull/8405#issuecomment-2863715898

   @sodonnel based on your comments I have another proposal to handle this 
issue. I can write that up in this doc as well so we can compare.
   
   The current proposal mixes a degraded volume state with a sort of volume 
decommissioning feature. The later is where most of the complexity comes from. 
As an initial change, we can make the degraded state purely a sort of alert 
that shows up via metrics, CLI, Recon, etc when a volume is experiencing 
numerous IO errors but is still reachable. The state does not need to be 
persisted in this case. At a later time, we can add volume decommissioning as a 
separate feature, which would handle persistence of the decom state, space 
calculation, moving data, and all that work similar to full datanode 
decommissioning. We could optionally add a config to have the system 
automatically decom degraded volumes. However, in this proposal volume 
decommissioning would be left as a future improvement, and the current scope of 
work would just be about flagging a degraded state for volumes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-07 Thread via GitHub


errose28 commented on PR #8405:
URL: https://github.com/apache/ozone/pull/8405#issuecomment-2860628521

   Thanks for checking this out @sodonnel. I can improve the motivation at the 
top of this doc, but the driving factor is the same as any changes we have made 
to replication manager, reconstruction, or reconciliation: As a storage system, 
we must prioiritize data durability over everything else, and we should never 
deliberately reduce data durability.
   
   > My observation from past problems on HDFS is that partially failed disks 
are a very large problem. They are hard to detect and sometimes reads on them 
can block for a very long time, resulting in hard to explain slow reads. I'd be 
more in favor of failing bad volumes completely,
   
   This is conflating two different issues with partially failed volumes: 
performance and durability. This doc is only concerned with data durability, 
which is more important. If a disk is causing performance problems then that 
should be identifed with metrics and alerting, which we also don't do well, but 
that would be a different proposal. We should not remove readable replicas 
without first copying them just to improve system performance.
   
   > The system is intended to handle the abrupt loss of a datanode or disk at 
any time, so what is driving the need for this proposal? Are volumes being 
failed too easily resulting in dataloss?
   
   There is a difference between us losing copies of data because of an 
external issue we are responding to, and us losing copies of data because we 
removed them ourselves. In the later case we are in control, and need to make 
new copies before removing existing ones. For reference, previously our 
handling of unhealthy replicas did not do this (we deleted them on sight) and 
this was rightfully changed.
   
   > If volumes are being failed to eagerly, then for what reason? Disk full, 
checksum errors, outright failed reads?
   
   This seems to imply that there is an exact set of criteria fail a volume, 
and anything outside of that is either "too eager" or "not eager enough". Disk 
failures are a fuzzy problem and I don't think such an exact set of criteria 
exists. The purpose of adding an intermediate state is to safely account for 
this unknown, rather than pin down a binary definition of volume health which 
becomes closely tied to our durability guarantees.
   
   > We do have mechanisms to repair bad containers already (scanner and 
reconcilor), so that part is handled.
   
   This is true. An alternate proposal would be to keep the current criteria we 
are using for volume failure, and discard all checks that this doc currently 
proposes using to move a volume to degraded health. Then let scanner + 
reconciler fix things as we go. I considered this approach and I'm actually not 
opposed to it, my hesitation was that it seems irresponsible to treat volumes 
that are frequently throwing errors the same as if they are totally healthy. We 
cannot choose to fail these reachable volumes without first copying all their 
data though.
   
   > What is considered an IO error which can trigger an ondemand scan? Is it a 
checksum validation or an unexpect EOF / data length error? Are we keeping a 
sliding windown count of each unique block so that 10 failures on the same 
block only counts as 1 rather than 10?
   
   Everything listed here could trigger an on-demand scan. Currently the 
on-demand volume scanner is plugged into the `catch` blocks of most datanode IO 
paths. The sliding windows are planned to be tracked at a per-disk level, but 
this raises a good point that if one bad sector becomes hot it may artifically 
cause the volume to seem worse than it is purely based on scan counts.
   
   Overall I agree that there is complexity involved here, and I am not tied to 
this particular solution. One alternate proposal could be to improve our disk 
health metrics and dashboards, maybe putting some info in Recon, to alert when 
disks have reached a degraded state. But at that point the safe way out would 
be disk decommissioning, which would be a new feature that looks similar to 
this one.
   
   Regardless of the proposal, I do think we need change in this area. As 
stated at the top of the doc, currently our only two options to handle partial 
volume failures are to reduce durability by removing all data on a disk that is 
potentially still readable, or swallow disk errors with the scanner and 
continue to put new data on this volume as if nothing is wrong.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For addi

Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-07 Thread via GitHub


sodonnel commented on PR #8405:
URL: https://github.com/apache/ozone/pull/8405#issuecomment-2857905118

   My observation from past problems on HDFS is that partially failed disks are 
a very large problem. They are hard to detect and sometimes reads on them can 
block for a very long time, resulting in hard to explain slow reads. I'd be 
more in favor of failing bad volumes completely, rather than 
   
   
   I understand the idea this doc is going with, but it does add quite a bit of 
new complexity to the system:
   
* The new degraded state and DN excluding it for writes
* DN capacity reduction perhaps? Balancer has to factor this in.
* SCM tracking container to volume mappings
* The replication flow considering the new state
* Probably a need for SCM to tell clients to try the degraded volume last
   
   The system is intended to handle the abrupt loss of a datanode or disk at 
any time, so what is driving the need for this proposal? Are volumes being 
failed too easily resulting in dataloss?
   
   If volumes are being failed to eagerly, then for what reason? Disk full, 
checksum errors, outright failed reads?
   
   We do have mechanisms to repair bad containers already (scanner and 
reconcilor), so that part is handled.
   
   What is considered an IO error which can trigger an ondemand scan? Is it a 
checksum validation or an unexpect EOF / data length error? Are we keeping a 
sliding windown count of each unique block so that 10 failures on the same 
block only counts as 1 rather than 10?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-8387. Improved Storage Volume Handling in Datanodes [ozone]

2025-05-06 Thread via GitHub


errose28 commented on PR #8405:
URL: https://github.com/apache/ozone/pull/8405#issuecomment-2856570153

   cc @ptlrs who helped work on this design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]