Siyao Meng created HDDS-12151:
---------------------------------

             Summary: Check every single write operation on an Ozone datanode 
is handled correctly in the case of full volumes
                 Key: HDDS-12151
                 URL: https://issues.apache.org/jira/browse/HDDS-12151
             Project: Apache Ozone
          Issue Type: Task
            Reporter: Siyao Meng


We observed cases where full Ozone volumes (reserved space also gone. 
StorageVolumeChecker throwing) could lead to complications like container state 
divergence, where a container could end up having replicas with different 
contents (blocks).

Even if we had {{hdds.datanode.dir.du.reserved}} or 
{{hdds.datanode.dir.du.reserved.percent}} , it doesn't seem to be fully 
respected by the datanode itself because we have seen some volumes with 0 bytes 
left. Plus, we couldn't control what other potential applications could have 
done to the volume mount.

----

List of Ozone datanode write operations (including Ratis ones) that needs to be 
checked on top of my head:

1. Ratis log append -- located under 
{{dfs.container.ratis.datanode.storage.dir}} . when the mount is full, the 
Ratis server would shut down (the datanode might also shut down with it)
2. WriteChunk
3. Container metadata RocksDB updates
4. StorageVolumeChecker canary -- interval controlled by 
hdds.datanode.periodic.disk.check.interval.minutes
5. datanode log append (typically on a different volume tho)

----

We have to make sure any of those operations would handle the "disk full" / "no 
space left" situation gracefully. The goal is to make sure issues like 
container replica divergence won't happen again.

[~erose]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to