vtutrinov commented on code in PR #10103: URL: https://github.com/apache/ozone/pull/10103#discussion_r3458233819
########## hadoop-hdds/docs/content/design/om-multiraft.md: ########## @@ -0,0 +1,517 @@ +# Ozone Multi-Raft Design Document + +## Abstract + +This document proposes a multi-raft architecture for Apache Ozone's Ozone Manager (OM) to improve write throughput and scalability by distributing bucket write requests across multiple independent RAFT groups, eliminating the single-leader bottleneck in the current architecture. + +## Background + +### Current Architecture Limitations + +Apache Ozone currently uses a single RAFT consensus group for the Ozone Manager (OM) in high availability (HA) deployments. While this provides strong consistency and automatic failover, it has several limitations: + +1. **Single Leader Bottleneck**: All write operations must go through a single OM leader, limiting write throughput regardless of the number of OM replicas +2. **RAFT Log Contention**: A single RAFT log serializes all metadata updates, creating a scalability bottleneck +3. **Resource Underutilization**: In a 3-node OM cluster, only one node actively processes write requests +4. **Limited Horizontal Scalability**: Adding more OM nodes improves read capacity (with follower reads) but not write capacity + +### Scalability Requirements + +As Ozone deployments grow to support: +- Thousands of buckets across multiple volumes +- Millions of concurrent client operations +- Petabytes of data with billions of objects + +The current single-raft architecture becomes a significant bottleneck for metadata operations. + +## Goal +**Improve Write Throughput**: Distribute write load across multiple RAFT leaders to achieve near-linear scaling with the number of OM nodes + +## Architecture + +### High-Level Design + +The multi-raft architecture partitions buckets write request across a configurable number of RAFT groups (default: 6). Each RAFT group: +- Has its own RAFT leader, followers, and log +- Processes write requests independently and in parallel +- Uses the same OM nodes but with different leaders Review Comment: For now (in MVP), a snapshot installation for a specific state machine could corrupt the trxId<->raftLogIndex state of other raft groups. Suggest per-node (not per-group) sync of the state machines (pause all state machines, sync, unpause) as an initial step (Phase 1) and separation of per-group DB or column families state as a Phase 2. A new commit with the suggestion above is ready -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
