[ 
https://issues.apache.org/jira/browse/HDDS-12151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDDS-12151:
------------------------------
    Description: 
We observed cases where full Ozone volumes (reserved space also gone. 
StorageVolumeChecker throwing "java.io.IOException: No space left on device") 
could lead to complications like container state divergence, where a container 
could end up having replicas with different contents (blocks).

Even if we had {{hdds.datanode.dir.du.reserved}} or 
{{hdds.datanode.dir.du.reserved.percent}} , it doesn't seem to be fully 
respected by the datanode itself because we have seen some volumes with 0 bytes 
left. Plus, we couldn't control what other potential applications could have 
done to the volume mount.

----

List of Ozone datanode write operations (including Ratis ones) that needs to be 
checked on top of my head:

1. Ratis log append -- located under 
{{dfs.container.ratis.datanode.storage.dir}} . when the mount is full, the 
Ratis server would shut down (the datanode might also shut down with it)
2. WriteChunk
3. Container metadata file update
4. Container RocksDB update
5. StorageVolumeChecker canary -- interval controlled by 
hdds.datanode.periodic.disk.check.interval.minutes
{code:title=Example of IOException: No space left on device thrown from 
disk/volume checker}
2025-01-21 14:45:07,165 ERROR [DataNode DiskChecker thread 
15]-org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume 
/mnt/volume12/hadoop-ozone/datanode/data/hdds failed health check. Could not 
write file 
/mnt/volume12/hadoop-ozone/datanode/data/hdds/CID-3d4afb70-9e42-4b56-8867-25fcd933f205/tmp/disk-check/disk-check-2a25d962-40cb-4ec8-8f0f-708853dccb7e
 for volume check.
java.io.IOException: No space left on device 
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:313)
        at 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:138)
        at 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:66)
        at 
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:641)
        at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:259)
        at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:68)
        at 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}
6. datanode log append (typically on a different volume tho)
7. During schema v3 container replication, metadata DB is first dumped to an 
external file on the source node -- seen in 
[KeyValueContainer#packContainerToDestination|https://github.com/apache/ozone/blob/189a9fe42013e82f93becaede627e509322c38f6/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java#L1012].
 Note the blocks looks to be streamed over to the destination directly.

----

We have to make sure any of those operations would handle the "disk full" / "no 
space left" situation gracefully. The goal is to make sure issues like 
container replica divergence won't happen again.

[~erose]

  was:
We observed cases where full Ozone volumes (reserved space also gone. 
StorageVolumeChecker throwing "java.io.IOException: No space left on device") 
could lead to complications like container state divergence, where a container 
could end up having replicas with different contents (blocks).

Even if we had {{hdds.datanode.dir.du.reserved}} or 
{{hdds.datanode.dir.du.reserved.percent}} , it doesn't seem to be fully 
respected by the datanode itself because we have seen some volumes with 0 bytes 
left. Plus, we couldn't control what other potential applications could have 
done to the volume mount.

----

List of Ozone datanode write operations (including Ratis ones) that needs to be 
checked on top of my head:

1. Ratis log append -- located under 
{{dfs.container.ratis.datanode.storage.dir}} . when the mount is full, the 
Ratis server would shut down (the datanode might also shut down with it)
2. WriteChunk
3. Container metadata RocksDB updates
4. StorageVolumeChecker canary -- interval controlled by 
hdds.datanode.periodic.disk.check.interval.minutes
{code:title=Example of IOException: No space left on device thrown from 
disk/volume checker}
2025-01-21 14:45:07,165 ERROR [DataNode DiskChecker thread 
15]-org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume 
/mnt/volume12/hadoop-ozone/datanode/data/hdds failed health check. Could not 
write file 
/mnt/volume12/hadoop-ozone/datanode/data/hdds/CID-3d4afb70-9e42-4b56-8867-25fcd933f205/tmp/disk-check/disk-check-2a25d962-40cb-4ec8-8f0f-708853dccb7e
 for volume check.
java.io.IOException: No space left on device 
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:313)
        at 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:138)
        at 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:66)
        at 
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:641)
        at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:259)
        at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:68)
        at 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}
5. datanode log append (typically on a different volume tho)
6. During schema v3 container replication, metadata DB is first dumped to an 
external file on the source node -- seen in 
[KeyValueContainer#packContainerToDestination|https://github.com/apache/ozone/blob/189a9fe42013e82f93becaede627e509322c38f6/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java#L1012].
 Note the blocks looks to be streamed over to the destination directly.

----

We have to make sure any of those operations would handle the "disk full" / "no 
space left" situation gracefully. The goal is to make sure issues like 
container replica divergence won't happen again.

[~erose]


> Check every single write operation on an Ozone datanode is handled correctly 
> in the case of full volumes
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-12151
>                 URL: https://issues.apache.org/jira/browse/HDDS-12151
>             Project: Apache Ozone
>          Issue Type: Task
>            Reporter: Siyao Meng
>            Priority: Major
>
> We observed cases where full Ozone volumes (reserved space also gone. 
> StorageVolumeChecker throwing "java.io.IOException: No space left on device") 
> could lead to complications like container state divergence, where a 
> container could end up having replicas with different contents (blocks).
> Even if we had {{hdds.datanode.dir.du.reserved}} or 
> {{hdds.datanode.dir.du.reserved.percent}} , it doesn't seem to be fully 
> respected by the datanode itself because we have seen some volumes with 0 
> bytes left. Plus, we couldn't control what other potential applications could 
> have done to the volume mount.
> ----
> List of Ozone datanode write operations (including Ratis ones) that needs to 
> be checked on top of my head:
> 1. Ratis log append -- located under 
> {{dfs.container.ratis.datanode.storage.dir}} . when the mount is full, the 
> Ratis server would shut down (the datanode might also shut down with it)
> 2. WriteChunk
> 3. Container metadata file update
> 4. Container RocksDB update
> 5. StorageVolumeChecker canary -- interval controlled by 
> hdds.datanode.periodic.disk.check.interval.minutes
> {code:title=Example of IOException: No space left on device thrown from 
> disk/volume checker}
> 2025-01-21 14:45:07,165 ERROR [DataNode DiskChecker thread 
> 15]-org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume 
> /mnt/volume12/hadoop-ozone/datanode/data/hdds failed health check. Could not 
> write file 
> /mnt/volume12/hadoop-ozone/datanode/data/hdds/CID-3d4afb70-9e42-4b56-8867-25fcd933f205/tmp/disk-check/disk-check-2a25d962-40cb-4ec8-8f0f-708853dccb7e
>  for volume check.
> java.io.IOException: No space left on device 
>         at java.io.FileOutputStream.writeBytes(Native Method)
>         at java.io.FileOutputStream.write(FileOutputStream.java:313)
>         at 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:138)
>         at 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:66)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:641)
>         at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:259)
>         at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.check(HddsVolume.java:68)
>         at 
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
>         at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> 6. datanode log append (typically on a different volume tho)
> 7. During schema v3 container replication, metadata DB is first dumped to an 
> external file on the source node -- seen in 
> [KeyValueContainer#packContainerToDestination|https://github.com/apache/ozone/blob/189a9fe42013e82f93becaede627e509322c38f6/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java#L1012].
>  Note the blocks looks to be streamed over to the destination directly.
> ----
> We have to make sure any of those operations would handle the "disk full" / 
> "no space left" situation gracefully. The goal is to make sure issues like 
> container replica divergence won't happen again.
> [~erose]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to