umamaheswararao commented on code in PR #343:
URL: https://github.com/apache/ozone-site/pull/343#discussion_r2814552456


##########
docs/07-system-internals/01-components/02-storage-container-manager/01-disk-layout.md:
##########
@@ -4,4 +4,143 @@ sidebar_label: Disk Layout
 
 # Storage Container Manager Disk Layout
 
-**TODO:** File a subtask under 
[HDDS-9862](https://issues.apache.org/jira/browse/HDDS-9862) and complete this 
page or section.
+## **Overview**
+
+The Storage Container Manager (SCM) is responsible for managing the containers 
and pipelines. To perform these tasks reliably, SCM persists its state in a set 
of local directories.
+
+## **Core Metadata Configurations**
+
+The following configuration keys define where the Storage Container Manager 
stores its persistent data. For production environments, it is recommended to 
host these directories on NVMe/SSDs to ensure high performance.
+
+- **`ozone.scm.db.dirs`**: Specifies the dedicated location for the Storage 
Container Manager RocksDB.
+- **`ozone.scm.ratis.storage.dir`**: Defines the storage location for Ratis 
(Raft) logs, which are essential for Storage Container Manager High 
Availability (HA).
+- **`ozone.metadata.dirs`**: Serves as the default location for 
security-related metadata (keys and certificates) and is often used as a 
fallback if specific DB directories are not defined.
+
+Ozone uses a hierarchical fallback system for configurations. Example the SCM 
looks for its CA location in this order:
+
+- `hdds.scm.ca.location`: The most specific: if this is set, it wins.
+- `hdds.scm.metadata.dirs`: SCM-wide metadata path.
+- `ozone.metadata.dirs`: Global fallback for all services.
+
+## **On-Disk Directory Structure**
+
+A typical SCM metadata directory structure looks like the following:
+
+```text
+/var/lib/hadoop-ozone/scm/
+├── data/                            # Primary Metadata (ozone.scm.db.dirs)
+│   ├── db.checkpoints/              # Point-in-time snapshots of SCM DB for 
external tools (e.g., Recon, HA)
+│   ├── scm/
+│   │   └── current/
+│   │       └── VERSION              # SCM identity and clusterID
+│   ├── scm.db/                      # Main RocksDB (Containers, Pipelines, 
etc.)
+│   │   ├── *.sst                    # Sorted String Table data files
+│   │   ├── CURRENT                  # Current manifest pointer
+│   │   ├── IDENTITY                 # DB Instance ID
+│   │   ├── MANIFEST-XXXXXX          # Database journal
+│   │   └── OPTIONS-XXXXXX           # RocksDB runtime configuration
+│   └── snapshot/                    # Ephemeral DB snapshots
+├── ozone-metadata/                  # Security Metadata (ozone.metadata.dirs)
+│   └── scm/
+│       ├── ca/                      # Root/Primary CA credentials
+│       │   ├── certs/
+│       │   │   └── certificate.crt  # Primary CA certificate
+│       │   └── keys/
+│       │       ├── private.pem      # CA Private key (keep secure)
+│       │       └── public.pem
+│       └── sub-ca/                  # Sub-CA credentials for SCM HA
+│           ├── certs/
+│           │   ├── <serial>.crt
+│           │   └── CA-1.crt         # Linked Primary CA cert
+│           └── keys/
+│               ├── private.pem
+│               └── public.pem
+└── scm-ha/                          # Raft Logs (ozone.scm.ratis.storage.dir)
+    └── <ratis-group-uuid>/          # Unique Ratis Ring ID
+        └── current/
+            ├── log_0-0              # Closed Raft log segments
+            ├── log_inprogress_271   # Active Raft log segment
+            ├── raft-meta            # Raft persistence state
+            └── raft-meta.conf       # Quorum membership info
+```
+
+## **Detailed Component Breakdown**
+
+### **1. RocksDB (`scm.db`)**
+
+SCM uses an embedded RocksDB to store all mapping and state information. This 
database consists of several column families (tables):
+
+- **Pipelines:** Maps pipeline IDs to the list of Datanodes forming the 
pipeline.  
+- **Containers:** Maps container IDs to their state (Open/Closed), replication 
type, and owner.  
+- **Deleted Blocks:** A queue of blocks that have been marked for deletion and 
need to be cleaned up from Datanodes.  
+- **Valid Certificates:** Stores certificates issued by the SCM Certificate 
Authority (CA).  
+- **Datanodes:** Tracks the registration and heartbeat status of all Datanodes 
in the cluster.
+
+### **2. The VERSION File**
+
+The VERSION file is created during `scm --init`. The VERSION file serves as 
the SCM's "identity card" and is used to enforce cluster-wide consistency 
during the handshake process. When the SCM starts, it reads its `clusterID` and 
`scmUuid` from this file to verify it is authorized to manage the metadata in 
its local directory; subsequently, when Datanodes attempt to register, the SCM 
cross-references its own `clusterID` and `cTime` with the information provided 
by the Datanodes to prevent nodes from different clusters from accidentally 
joining and causing data corruption. Furthermore, the layoutVersion inside the 
file is critical for software upgrades, as it tells the SCM which on-disk 
metadata features are currently active, ensuring that the service doesn't 
attempt to process data formats it doesn't recognize or support.
+
+Key fields include:
+
+- **nodeType**: Always SCM for this component.  
+- **clusterID**: The unique identifier for the entire Ozone cluster.  
+- **scmUuid**: The unique identifier for the SCM node.  
+- **layoutVersion**: The software-specific data layout version.
+- **cTime**: To track when different components were formatted or upgraded. 
This prevents older versions of the software from accidentally trying to manage 
data created by a newer version (Layout Versioning).
+
+A sample SCM version file looks like this
+
+```text
+#Fri Feb 13 10:58:02 PDT 2020
+nodeType=SCM
+scmUuid=fa1376f0-23ad-4cda-93d6-0c9fd79c7ae3
+clusterID=CID-2a3f36e8-506f-40eb-986e-bb79d188fd55
+cTime=1771013425196
+layoutVersion=0
+```
+
+### **3. Certificate Authority (CA)**
+
+The `ozone-metadata/scm/ca` directory contains key and certs sub-directory, 
which are used to persist the certificate and the public/private key pairs of 
SCM. The SCM private key is used to sign the issued certificates and tokens. In 
the context of SCM HA, the SCM can be either Root CA that issues certificates 
for SCM instances (a.k.a Sub SCM) that issues certificates to Ozone Manager and 
Datanodes.
+
+The issuing of certificates will be handled by SCM Ratis leader and the 
persistence of certificates into RocksDB will be replicated to SCM follower 
instances consistently.
+
+Among the SCM CA instances, there will be one designated as Primary SCM which 
acts as Root CA during SCM init. All the other SCM instances will be running 
bootstrap to get the Primary SCM issued SCM instance certificate.
+
+The SCM metadata above below use the all-in-one `ozone.metadata.dirs` without 
metadata DB on different drives.  As a result of that, the Primary SCM will 
have its metadata saved under `<The path of ozone.metadata.dirs>/scm/ca`.
+
+All the Sub SCM instances (including the one running on the primary SCM) 
security metadata are stored at `<The path of ozone.metadata.dir>/scm/sub-ca`, 
with keys and certs under it, respectively. Here is an example of Sub SCM 
security metadata layout.
+
+### **4. Ratis Logs**
+
+When SCM High Availability (HA) is enabled, SCM uses **Apache Ratis** to 
replicate its state across the SCM quorum.
+
+- The `ratis/` directory contains the Raft log segments.  
+- Every write request (e.g., container allocation) is first appended to this 
log and replicated to followers before being applied to the local `scm.db`.
+
+### **5. `db.checkpoints`**
+
+The db.checkpoints directory serves as a dedicated storage area for 
point-in-time snapshots of the active SCM RocksDB. These snapshots are 
primarily used for Recon integration and High Availability (HA) 
synchronization; when the Recon service or a lagging SCM follower needs to 
catch up to the current cluster state, SCM creates a consistent checkpoint here 
using filesystem hard links. This allows the system to export the database 
state without pausing write operations or duplicating large data files, 
ensuring that metadata can be transferred across the network while the main 
service remains online and performant.
+
+### Recommended Storage Configuration Mapping
+
+The following properties in `ozone-site.xml` control the disk layout:
+
+| Path Description | Configuration Key | Storage Type Recommendation | Purpose 
|
+| :---- | :---- | :---- | :---- |
+| **SCM Metadata Database** | `ozone.scm.db.dirs` | **NVMe (strongly 
recommended)** | Primary directory for SCM RocksDB and version files. |
+| **SCM Ratis Logs** | `ozone.scm.ratis.storage.dir` | **NVMe or very fast 
SSD** | Holds the Raft write-ahead log (WAL) for SCM consensus. Every metadata 
mutation must be fsynced before commit. Slow disks increase write latency 
across the entire cluster because clients wait for quorum commit. |
+| **General Ozone Metadata / Security Material** | `ozone.metadata.dirs` | 
**SSD preferred (HDD acceptable for small clusters)** | If 
`hdds.scm.ca.location` or `hdds.scm.metadata.dirs` not configured, it will 
fallback to this configuration to create SCM CA certificates are stored. |
+
+## **Layout Implementation for Different Environments**
+
+### **Development/Test Environments**
+
+For simplicity, a single "All-in-One" location can be used by setting 
`ozone.metadata.dirs`. All services (OM, SCM, DN) will store their metadata in 
sub-folders under this single path.
+
+### **Production Environments**
+
+It is strictly recommended to separate these directories. The `scm.db` and 
Ratis logs should reside on high-IOPS storage (SSDs/NVMe) to minimize latency 
for namespace operations, while security certificates can remain on standard 
persistent storage.
+
+- **Storage Type:** It is highly recommended to host the `scm.db` and `ratis/` 
directories on **NVMe or SAS SSDs** to minimize latency for block and container 
allocations.  
+- **Redundancy:** Use **RAID 1** (mirroring) for metadata disks to protect 
against local disk failure, even if SCM HA is enabled at the software layer.

Review Comment:
   Thanks @yandrey321, for taking a look. Updated with RAID 1+0 option as well 
in the doc. 
   #2, I agree. If drives fail and the current SCM can't be recovered, then we 
have the SCM node decommission option. I think we can update the 
troubleshooting guide. Please help to file a JIRA against the troubleshooting 
guide if this is not covered in the SCM decom.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to