vtutrinov commented on code in PR #10103:
URL: https://github.com/apache/ozone/pull/10103#discussion_r3458233819


##########
hadoop-hdds/docs/content/design/om-multiraft.md:
##########
@@ -0,0 +1,517 @@
+# Ozone Multi-Raft Design Document
+
+## Abstract
+
+This document proposes a multi-raft architecture for Apache Ozone's Ozone 
Manager (OM) to improve write throughput and scalability by distributing bucket 
write requests across multiple independent RAFT groups, eliminating the 
single-leader bottleneck in the current architecture.
+
+## Background
+
+### Current Architecture Limitations
+
+Apache Ozone currently uses a single RAFT consensus group for the Ozone 
Manager (OM) in high availability (HA) deployments. While this provides strong 
consistency and automatic failover, it has several limitations:
+
+1. **Single Leader Bottleneck**: All write operations must go through a single 
OM leader, limiting write throughput regardless of the number of OM replicas
+2. **RAFT Log Contention**: A single RAFT log serializes all metadata updates, 
creating a scalability bottleneck
+3. **Resource Underutilization**: In a 3-node OM cluster, only one node 
actively processes write requests
+4. **Limited Horizontal Scalability**: Adding more OM nodes improves read 
capacity (with follower reads) but not write capacity
+
+### Scalability Requirements
+
+As Ozone deployments grow to support:
+- Thousands of buckets across multiple volumes
+- Millions of concurrent client operations
+- Petabytes of data with billions of objects
+
+The current single-raft architecture becomes a significant bottleneck for 
metadata operations.
+
+## Goal
+**Improve Write Throughput**: Distribute write load across multiple RAFT 
leaders to achieve near-linear scaling with the number of OM nodes
+
+## Architecture
+
+### High-Level Design
+
+The multi-raft architecture partitions buckets write request across a 
configurable number of RAFT groups (default: 6). Each RAFT group:
+- Has its own RAFT leader, followers, and log
+- Processes write requests independently and in parallel
+- Uses the same OM nodes but with different leaders

Review Comment:
   For now (in MVP), a snapshot installation for a specific state machine could 
corrupt the trxId<->raftLogIndex state of other raft groups. Suggest per-node 
(not per-group) sync of the state machines (pause all state machines, sync, 
unpause) as an initial step (Phase 1) and separation of per-group DB or column 
families state as a Phase 2. A new commit with the suggestion above is ready



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to