Wei-Chiu Chuang created HDDS-15594:
--------------------------------------
Summary: Rack Aware and Rack Scatter Policies to consider rack
density/capacity
Key: HDDS-15594
URL: https://issues.apache.org/jira/browse/HDDS-15594
Project: Apache Ozone
Issue Type: Task
Reporter: Wei-Chiu Chuang
h1. Problem Statement
The current Apache Ozone container placement policies
(`SCMContainerPlacementRackAware` and `SCMContainerPlacementRackScatter`)
select racks and nodes under those racks without considering the aggregate
storage capacity or density of the racks.
*(Note: Node-level capacity awareness is addressed separately; this proposal
focuses strictly on rack-level capacity imbalances).*
In a heterogeneous deployment (e.g., where Rack A has an aggregate capacity of
1PB across its datanodes, and Rack B has an aggregate capacity of 10PB):
1. **Uniform Rack Selection:** The placement policies select racks uniformly at
random or in a round-robin/scatter manner to satisfy rack-level fault tolerance
(e.g., 2 racks for 3-replica Ratis pipelines, or maximizing unique racks for
Erasure Coding).
2. **Aggressive Depletion of Low-Capacity Racks:** Because the placement policy
treats all racks as having equal capacity, the 1PB rack receives a similar
number of container allocations as the 10PB rack. Consequently, the 1PB rack
will reach capacity **10 times faster**.
3. **Loss of Rack-Level Fault Tolerance:** Once the 1PB rack is full, SCM can
no longer allocate new containers that span that rack. This forces SCM to
either fail new container allocations or fallback to placing multiple replicas
on the same rack, violating the rack-safety policy.
4. **Sub-optimal Rebalancing:** Even if individual datanode capacity policies
(or the `ContainerBalancer`) optimize node-level usage, the placement path
lacks the global rack-level awareness needed to prevent entire low-capacity
racks from filling up prematurely.
h1. Proposed Improvement
We should introduce rack capacity and density awareness into SCM placement
policies to weight the selection of racks themselves:
h2. Option 1: Rack-Weighted Selection
* Modify the rack selection step in `SCMContainerPlacementRackAware` and
`SCMContainerPlacementRackScatter`.
* The probability of selecting a particular rack should be weighted by the
aggregate capacity (or aggregate remaining capacity) of all healthy, active
datanodes in that rack.
h2. Option 2: Rack-Capacity aware Placement Constraints
* When selecting target racks for container replicas (especially under Erasure
Coding where spanning multiple racks is critical), the algorithm should
optimize the distribution such that higher-density racks host a proportionally
larger share of the replica load without violating fault tolerance bounds.
h1. Benefits
* Prevents low-capacity racks from filling up prematurely in heterogeneous rack
environments.
* Proactively preserves rack-level fault tolerance for the entire cluster by
spreading the write load proportionally to each rack's storage footprint.
* Reduces I/O and network overhead generated by reactive post-write balancing
(`ContainerBalancer`).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]