[ 
https://issues.apache.org/jira/browse/HDDS-12151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDDS-12151:
------------------------------
    Description: 
We observed cases where full Ozone volumes (reserved space also gone. 
StorageVolumeChecker throwing "java.io.IOException: No space left on device") 
could lead to complications like container state divergence, where a container 
could end up having replicas with different contents (blocks).

Even if we had {{hdds.datanode.dir.du.reserved}} or 
{{hdds.datanode.dir.du.reserved.percent}} , it doesn't seem to be fully 
respected by the datanode itself because we have seen some volumes with 0 bytes 
left. Plus, we couldn't control what other potential applications could have 
done to the volume mount.

----

List of Ozone datanode write operations (including Ratis ones) that needs to be 
checked on top of my head:

1. Ratis log append -- located under 
{{dfs.container.ratis.datanode.storage.dir}} . when the mount is full, the 
Ratis server would shut down (the datanode might also shut down with it)
2. WriteChunk
3. Container metadata RocksDB updates
4. StorageVolumeChecker canary -- interval controlled by 
hdds.datanode.periodic.disk.check.interval.minutes
{code:title=Example of IOException: No space left on device thrown from 
disk/volume checker}
2025-01-21 14:45:07,165 ERROR [DataNode DiskChecker thread 
15]-org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume 
/mnt/volume12/hadoop-ozone/datanode/data/hdds failed health check. Could not 
write file 
/mnt/volume12/hadoop-ozone/datanode/data/hdds/CID-3d4afb70-9e42-4b56-8867-25fcd933f205/tmp/disk-check/disk-check-2a25d962-40cb-4ec8-8f0f-708853dccb7e
 for volume check.
java.io.IOException: No space left on device 
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:313)
        at 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:138)
        at 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:66)
        at 
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:641)
        at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:259)
        at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:68)
        at 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}
5. datanode log append (typically on a different volume tho)
6. During schema v3 container replication, metadata DB is first dumped to an 
external file on the source node -- seen in 
[KeyValueContainer#packContainerToDestination|https://github.com/apache/ozone/blob/189a9fe42013e82f93becaede627e509322c38f6/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java#L1012].
 Note the blocks looks to be streamed over to the destination directly.

----

We have to make sure any of those operations would handle the "disk full" / "no 
space left" situation gracefully. The goal is to make sure issues like 
container replica divergence won't happen again.

[~erose]

  was:
We observed cases where full Ozone volumes (reserved space also gone. 
StorageVolumeChecker throwing "") could lead to complications like container 
state divergence, where a container could end up having replicas with different 
contents (blocks).

Even if we had {{hdds.datanode.dir.du.reserved}} or 
{{hdds.datanode.dir.du.reserved.percent}} , it doesn't seem to be fully 
respected by the datanode itself because we have seen some volumes with 0 bytes 
left. Plus, we couldn't control what other potential applications could have 
done to the volume mount.

----

List of Ozone datanode write operations (including Ratis ones) that needs to be 
checked on top of my head:

1. Ratis log append -- located under 
{{dfs.container.ratis.datanode.storage.dir}} . when the mount is full, the 
Ratis server would shut down (the datanode might also shut down with it)
2. WriteChunk
3. Container metadata RocksDB updates
4. StorageVolumeChecker canary -- interval controlled by 
hdds.datanode.periodic.disk.check.interval.minutes
5. datanode log append (typically on a different volume tho)
6. During schema v3 container replication, metadata DB is first dumped to an 
external file on the source node -- seen in 
[KeyValueContainer#packContainerToDestination|https://github.com/apache/ozone/blob/189a9fe42013e82f93becaede627e509322c38f6/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java#L1012].
 Note the blocks looks to be streamed over to the destination directly.

----

We have to make sure any of those operations would handle the "disk full" / "no 
space left" situation gracefully. The goal is to make sure issues like 
container replica divergence won't happen again.

[~erose]


> Check every single write operation on an Ozone datanode is handled correctly 
> in the case of full volumes
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-12151
>                 URL: https://issues.apache.org/jira/browse/HDDS-12151
>             Project: Apache Ozone
>          Issue Type: Task
>            Reporter: Siyao Meng
>            Priority: Major
>
> We observed cases where full Ozone volumes (reserved space also gone. 
> StorageVolumeChecker throwing "java.io.IOException: No space left on device") 
> could lead to complications like container state divergence, where a 
> container could end up having replicas with different contents (blocks).
> Even if we had {{hdds.datanode.dir.du.reserved}} or 
> {{hdds.datanode.dir.du.reserved.percent}} , it doesn't seem to be fully 
> respected by the datanode itself because we have seen some volumes with 0 
> bytes left. Plus, we couldn't control what other potential applications could 
> have done to the volume mount.
> ----
> List of Ozone datanode write operations (including Ratis ones) that needs to 
> be checked on top of my head:
> 1. Ratis log append -- located under 
> {{dfs.container.ratis.datanode.storage.dir}} . when the mount is full, the 
> Ratis server would shut down (the datanode might also shut down with it)
> 2. WriteChunk
> 3. Container metadata RocksDB updates
> 4. StorageVolumeChecker canary -- interval controlled by 
> hdds.datanode.periodic.disk.check.interval.minutes
> {code:title=Example of IOException: No space left on device thrown from 
> disk/volume checker}
> 2025-01-21 14:45:07,165 ERROR [DataNode DiskChecker thread 
> 15]-org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume 
> /mnt/volume12/hadoop-ozone/datanode/data/hdds failed health check. Could not 
> write file 
> /mnt/volume12/hadoop-ozone/datanode/data/hdds/CID-3d4afb70-9e42-4b56-8867-25fcd933f205/tmp/disk-check/disk-check-2a25d962-40cb-4ec8-8f0f-708853dccb7e
>  for volume check.
> java.io.IOException: No space left on device 
>         at java.io.FileOutputStream.writeBytes(Native Method)
>         at java.io.FileOutputStream.write(FileOutputStream.java:313)
>         at 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:138)
>         at 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:66)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:641)
>         at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:259)
>         at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:68)
>         at 
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
>         at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> 5. datanode log append (typically on a different volume tho)
> 6. During schema v3 container replication, metadata DB is first dumped to an 
> external file on the source node -- seen in 
> [KeyValueContainer#packContainerToDestination|https://github.com/apache/ozone/blob/189a9fe42013e82f93becaede627e509322c38f6/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java#L1012].
>  Note the blocks looks to be streamed over to the destination directly.
> ----
> We have to make sure any of those operations would handle the "disk full" / 
> "no space left" situation gracefully. The goal is to make sure issues like 
> container replica divergence won't happen again.
> [~erose]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to