redwood9 opened a new issue, #3380:
URL: https://github.com/apache/kvrocks/issues/3380

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/kvrocks/issues) and found no similar issues.
   
   
   ### Motivation
   
   # Problem Statement
   The current kvrocks and controller have a split-brain risk under network 
partition scenarios. Due to the lack of a lease mechanism, the old kvrocks 
master may continue accepting writes indefinitely.
   
   # Design Principles
   The following principles are derived from the analysis, balancing 
implementation cost against benefit.
   
   1. kvrocks master gates write operations on a lease; no lease means writes 
are rejected (configurable). The controller probe must carry: current master 
ID, cluster version, and lease duration. When kvrocks receives a probe, and the 
master ID and version match, it extends the lease.
   2. Shorten the controller session TTL to detect network partition events 
more quickly.
   3. Allow manual designation of the active partition; the controller provides 
a one-command switch to move the kvrocks master to the specified partition, for 
use when a partition failure is already known.
   4. kvrocks master validates its master status via an API call on each write 
(configurable); the controller exposes an API that returns who the current 
master is.
   
   # Rationale
   ## The Core Problem: Old Master Keeps Writing During Network Partition
   Consider the following sequence of events:
   1. T=0: Network partition. Controller loses visibility of M. M is unaware 
and continues accepting writes.
   2. T=3~12s: Probe fails 4 consecutive times. Only the failure counter 
increments; no action is taken.
   3. T=15s: 5th failure. Controller elects S as the new master, writes to 
etcd, pushes SETNODES to S. S becomes master.
   4. M never receives SETNODES. It still believes itself to be master and 
continues accepting writes. Two masters coexist; data diverges.
   5. Network recovers. Controller detects M has a stale version and pushes the 
new topology. M is demoted to slave and syncs from S. All writes made during 
the partition are lost.
   
   The faster the master detects the anomaly and stops writing, the sooner data 
divergence is prevented. The cost tradeoffs are:
   1. Per-write: master checks with the controller on every write. Most 
effective, but adds network round-trip cost, increases etcd concurrency, and 
raises write latency.
   2. Per-write: master checks a local lease cache; rejects writes if the cache 
is expired. The controller probe refreshes the cache. Better cost/benefit 
balance, but a brief split-brain window exists if a partition occurs during a 
valid cache period and a new master is assigned.
   3. Options 1 and 2 are complementary. Use option 1 for high-consistency 
requirements, option 2 otherwise. This can be controlled by configuration.
   
   ## Impact of Single-Node Controller Probe
   Because the controller master is the sole probe source for the kvrocks 
master, two failure modes exist under network partition:
   1. The controller master can probe the kvrocks master successfully, but both 
are in the failed partition and cannot serve clients.
   2. The controller is in the failed partition and cannot reach the kvrocks 
master, but the kvrocks master is in the healthy partition — triggering an 
unnecessary failover.
   
   Some industry approaches use probes from clients in multiple partitions 
simultaneously, combining results to decide whether to trigger a failover. This 
requires an additional quorum mechanism.
   
   Given current constraints, using etcd's own mechanism to keep the controller 
master in the reachable partition is a pragmatic tradeoff:
   1. etcd `minLeaseTTL` = 2s; the minimum achievable lease is 2 seconds.
   2. `controller sessionTTL` = 6, hardcoded as a constant. This can be made 
configurable; halving it to 3s is a reasonable starting point.
   3. Controller-to-kvrocks failover window = `ping_interval × max_ping_count` 
= 3s × 5 = 15s. Consider reducing this, e.g. to 10s.
   
   Therefore:
   1. For high-consistency requirements: kvrocks must actively validate its 
master status via network on every write.
   2. Otherwise: the maximum split-brain window can be reduced to 12s 
(currently 21s). A split-brain can still occur within that 12s window — this is 
an explicit tradeoff.
   
   ## Manual Intervention During Partition Failure
   Consider a concrete network partition example: in a multi-region deployment, 
an entire region goes offline. Providing a manual intervention mechanism could 
make the recovery process more controlled — for example, a command to switch to 
a specified region. Once triggered, the system prioritizes instances in the 
designated region during the next master election.
   
   If this approach is considered viable, I can implement it incrementally — 
starting with the lease mechanism for kvrocks master write operations, made 
configurable.
   
   ### Solution
   
   1. kvrocks master gates write operations on a lease; no lease means writes 
are rejected (configurable). The controller probe must carry: current master 
ID, cluster version, and lease duration. When kvrocks receives a probe, and the 
master ID and version match, it extends the lease.
   2. Shorten the controller session TTL to detect network partition events 
more quickly.
   3. Allow manual designation of the active partition; the controller provides 
a one-command switch to move the kvrocks master to the specified partition, for 
use when a partition failure is already known.
   4. kvrocks master validates its master status via an API call on each write 
(configurable); the controller exposes an API that returns who the current 
master is.
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to