Andrew Kyle Purtell created HBASE-30139:
-------------------------------------------
Summary: RAFT-Based Promotable Region Replicas
Key: HBASE-30139
URL: https://issues.apache.org/jira/browse/HBASE-30139
Project: HBase
Issue Type: New Feature
Components: meta replicas, read replicas
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
HBase supports configuring a table with multiple region replicas. When a table
has replicas, each region exists as a primary copy and one or more read-only
copies hosted on different RegionServers. The primary handles all client writes
and serves the default read path. Two read-only replicas are opened on other
RegionServers, sharing the primary's HFiles on HDFS, and receive memstore
updates through an asynchronous replication pipeline. Clients may read from
replicas using timeline-consistent reads. Replicas cannot accept writes and
cannot be promoted to primary. This model improves read availability for stale
data tolerant workloads, but it does nothing for write availability or fast
failover. When the primary's RegionServer dies, the region becomes unavailable
for writes. Read-only replicas can still serve timeline-consistent reads, but
with increasingly stale data. Replicas can be arbitrarily behind the primary,
so even their stale-read utility degrades under replication lag. There is no
protocol to determine which replica is most current or to coordinate a handoff.
This design replaces the asynchronous WAL replication pipeline with RAFT
consensus groups at the region level. Each set of replicas for a region forms a
RAFT group. The primary region acts as the RAFT leader, and the read-only
replica regions act as RAFT followers. The leader replicates edits
synchronously through RAFT to keep follower memstores warm and consistent,
replacing the best-effort async pipeline with an ordered, majority-committed
log.
The key improvement is {*}promotability{*}. When the primary fails, the
surviving followers already hold a warm, consistent memstore. They elect a new
RAFT leader among themselves, and the elected leader reports the election
result to the master. The master's AssignmentManager remains the sole arbiter
of which region is primary. It validates the RAFT election term, updates META
to record the new primary location, and returns confirmation to the
RegionServer. Only after receiving this confirmation does the promoted replica
complete its local state transitions and begin serving writes. However, this
happens very quickly compared to today's WAL mediated recovery pathway. There
is no WAL splitting and no recovered-edits replay. Failover completes in
sub-second to low-single-digit seconds.
Some of you may remember Facebook's ancient "HydraBase". This is NOT HydraBase
redux and does not repeat its design errors.
Design document:
[https://github.com/apurtell/hbase/blob/WORK-raft-replicas/RAFT_REGION_REPLICAS.md]
{{hbase-consensus}} proof-of-concept:
[https://github.com/apurtell/hbase/blob/WORK-raft-replicas/hbase-consensus/]
Currently this is at "science project" stage. When that changes I will update
this part of the summary with strikethrough.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)