[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes

Xinyu Tan (Jira) Mon, 16 Feb 2026 02:22:34 -0800


    [ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058852#comment-18058852
 ]


Xinyu Tan commented on RATIS-2403:
----------------------------------

[~ivanandika] 

Sorry for the delayed reply — we’ve just been celebrating the Chinese New Year 
here in China. Thank you again for your patience and for the thoughtful 
discussion. I’d like to share a few points:
----
*1. Regarding follower read usage in IoTDB*

In IoTDB, we do not use follower read. Instead, we rely on linearizable read 
and lease read.

The main reason is that our cluster size is typically on the order of dozens of 
nodes. For control-plane nodes, although we usually have only a single Raft 
group, we have not yet encountered bottlenecks at the Raft layer. For 
data-plane nodes, we use a multi-Raft architecture to distribute load. As long 
as there are no significant hotspots, read and write traffic can be evenly 
distributed across leaders of different Raft groups, and therefore across all 
nodes.

In some scenarios, we observed that enabling follower read requires maintaining 
certain state machine cache states consistently across multiple replicas, which 
introduces additional overhead. Considering all of this, we chose to use 
multi-Raft to balance load at the data-plane level, and within each Raft group, 
we keep both reads and writes on the leader to maximize resource efficiency.
----
*2. On the Ozone scenario*

In your Ozone scenario, if the system has not yet reached any physical 
bottleneck (CPU, disk, network, etc.), why would enabling follower read 
actually reduce query throughput?

Is it because write throughput becomes higher after enabling follower read? If 
so, that would suggest we need to introduce some form of write throttling to 
rebalance resource usage. If not, then it may indicate other bottlenecks, and 
we probably need to profile the system more carefully to identify the root 
cause.
----
*3. On leader throttling approaches*

I reviewed your {{{}leader-backpressure.patch{}}}. Personally, I don’t think 
defining a maximum gap between the leader and the slowest follower is a 
reasonable solution.

In a production environment, even if we configure a practical threshold, once a 
follower has been down for a sufficiently long time, the gap will eventually 
exceed the limit, which could block the entire Raft group from making further 
progress on writes. This violates the liveness guarantees of Raft.

Moreover, in a three-replica Raft group, even if one follower lags 
significantly behind, it can theoretically catch up later via snapshot 
installation. Therefore, tying write availability to the slowest follower’s gap 
does not seem like a robust throttling strategy.

Instead, I would prefer limiting the rate at which the leader advances the 
{{{}commitIndex{}}}. For example, we could detect the advancement rate of 
{{commitIndex}} inside {{{}tryAcquirePendingRequest{}}}, and if it is 
progressing too quickly, temporarily block or slow down new writes.

Although this approach may require case-by-case tuning (e.g., how many write 
ops per second to allow), it provides a clear upper bound on the resources 
consumed by writes, thereby reserving sufficient headroom for read queries.

> Improve linearizable follower read throughput instead of writes
> ---------------------------------------------------------------
>
>                 Key: RATIS-2403
>                 URL: https://issues.apache.org/jira/browse/RATIS-2403
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Ivan Andika
>            Priority: Major
>         Attachments: leader-backpressure.patch
>
>
> While benchmarking linearizable follower read, the observation is that the 
> more requests go to the followers instead of the leader, the better write 
> throughput becomes, we saw around 2-3x write throughput increase compared to 
> the leader-only write and read (most likely due to less leader resource 
> contention). However, the read throughput becomes worst than leader-only 
> write and read  (some can be below 0.2x). Even with optimizations such as 
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, 
> the read throughput remains worse than leader-only write (it even improves 
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at 
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only 
> write and reads. Currently pure reads (no writes) performance improves read 
> throughput up to 1.7x, but total follower read throughput is way below this 
> target.
> Currently my ideas are
>  * Sacrificing writes for reads: Can we limit the write QPS so that read QPS 
> can increase
>  ** From the benchmark, the read throughput only improves when write 
> throughput is lower
>  ** We can try to use backpressure mechanism so that writes do not advance so 
> quickly that read throughput suffer
>  *** Follower gap mechanisms (RATIS-1411), but this might cause leader to 
> stall if follower down for a while (e.g. restarted), which violates the 
> majority availability guarantee. It's also hard to know which value is 
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes

Reply via email to