[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes

Ivan Andika (Jira) Mon, 23 Feb 2026 02:02:29 -0800


    [ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060285#comment-18060285
 ]


Ivan Andika commented on RATIS-2403:
------------------------------------

Thank you [~tanxinyu]  and apologies for the late reply. Happy CNY, I was also 
celebrating CNY last week.
{quote}*1. Regarding follower read usage in IoTDB*
{quote}
Thanks for the insight regarding the usage of linearizable read on IoTDB. The 
control plane and data plane setup is similar to Ozone's OM and Datanodes, 
where the control plane contains a single Raft group and the datanodes uses 
multi-raft setup to distribute load. Currently, we are encountering bottleneck 
on the control nodes (some high RPC queue issues on the OM leader mostly due to 
the lock contentions (e.g. multiple keys operations need to get the same bucket 
lock) which blocks caller threads). We intend to enable follower read to try to 
improve the read throughput of OM, but it has the unintended effect of reducing 
read throughput while increasing write throughput. For the data planes, similar 
to IoTDB we also uses multi-Raft (a single Raft group is called a "Ratis 
Pipeline") to balance load at the data-plane level, but unlike IoTDB the reads 
on data-plane do not use Ratis to read the data, instead we use an internal 
sequence ID / version (called Block Sequence ID) which is correspond to the 
Raft log index, when user wants to read the key, it will check the BCSID stored 
in the meta nodes and compare with the ones in the datanode replica, if it's 
mismatched it means that the datanode replica is stale and client will pick 
another node. The Raft groups in Ozone pipelines are also relatively 
short-lived, where it can be destroyed and recreated in response to failures, 
migration, etc.
{quote}*2. On the Ozone scenario*
{quote}
> Is it because write throughput becomes higher after enabling follower read?

Yes, this is the current behavior I observed.
{quote}{*}3. On leader throttling approaches{*}{*}{{*}}
{quote}
Yes, the maximum gap solution has a lot of issues (as you mentioned) and only 
serves as a way to prove my hypothesis (that there is a inverse relationship 
between write throughput and read throughput). I have attached the approach 
suggested by [~szetszwo] [^leader-batch-write.patch] . Although it performs 
better than the leader throttling approaches, it is also inflexible since we 
don't know how frequent we need to batch write get the desired read and write 
performance and the interval cannot dynamically adjust based on the current 
workloads (read-heavy or write-heavy). I tested that 10ms improved the read 
throughput 1.3-1.7x, but the write throughput can be degraded to 0.7x (one case 
degrades to 0.3x), although there is one case where read and write throughput 
improved by 1.x to 1.4x. When I decreased it to 5ms, the read QPS throughput 
degraded again for write-heavy cases (although there are one case where both 
read and write throughput improved by 1.2x-1.3x). Therefore, these solutions 
require periodic and careful tuning which is not ideal and therefore might not 
be good to be pushed to Ratis upstream.

I had some thoughts regarding the this issue and identified issues in my 
benchmark setup. I use 100 clients on both the leader read setup (baseline) and 
the follower-read setup (client can read from both leader and followers). The 
realization is that 100 clients follower read will never be better than 100 
client leader read unless the leader is already saturated (high resource usage 
or RPC queue latency is higher than readindex) since leader will respond almost 
immediately but follower will need to wait (even 1-2ms is a lot higher than 
immediate return). A better setup for the follower reads might setup two 
concurrent setups 1) 100 clients for leader reads 2) 200 clients for follower 
reads, each 100 clients are assigned to each follower (assuming 2 followers 1 
leader) so that each node is connected by 100 clients. If we can verify that 
the leader is not degraded when we added the additional 200 clients to 
exclusively to the followers, this means that follower read works when there 
are new read throughput. That said, this also means that we cannot simply 
enable follower reads on the existing workload since the existing workload read 
throughput might be degraded. I will benchmark and validate this.

Another problem is that the benchmark only uses a single client which should 
make the Ozone / Hadoop Rate Limiting ineffective. Ozone / Hadoop rate limiting 
will deprioritize users (identified by username) to lower queue if they send a 
lot requests (which is weighted based on things like how long the request holds 
exclusive / write or shared / read locks), but if there is only one user, these 
QoS will not work. Therefore, I'll try to change my benchmark to have different 
users to see whether the Rate limiter will be able to balance the read and 
writes. This requires that some reads to be sent to the leader since the rate 
limiter is not distributed, so this also have some possible issues.

Getting read throughput improvements while not any sacrificing linearizability 
is tricky. In the next step, I will try to attempt RATIS-2376 to support 
timestamp-based staleness bound.

Let me know what you think. Thanks again for the discussion and the food for 
thoughts.

 

> Improve linearizable follower read throughput instead of writes
> ---------------------------------------------------------------
>
>                 Key: RATIS-2403
>                 URL: https://issues.apache.org/jira/browse/RATIS-2403
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Ivan Andika
>            Priority: Major
>         Attachments: leader-backpressure.patch, leader-batch-write.patch
>
>
> While benchmarking linearizable follower read, the observation is that the 
> more requests go to the followers instead of the leader, the better write 
> throughput becomes, we saw around 2-3x write throughput increase compared to 
> the leader-only write and read (most likely due to less leader resource 
> contention). However, the read throughput becomes worst than leader-only 
> write and read  (some can be below 0.2x). Even with optimizations such as 
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, 
> the read throughput remains worse than leader-only write (it even improves 
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at 
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only 
> write and reads. Currently pure reads (no writes) performance improves read 
> throughput up to 1.7x, but total follower read throughput is way below this 
> target.
> Currently my ideas are
>  * Sacrificing writes for reads: Can we limit the write QPS so that read QPS 
> can increase
>  ** From the benchmark, the read throughput only improves when write 
> throughput is lower
>  ** We can try to use backpressure mechanism so that writes do not advance so 
> quickly that read throughput suffer
>  *** Follower gap mechanisms (RATIS-1411), but this might cause leader to 
> stall if follower down for a while (e.g. restarted), which violates the 
> majority availability guarantee. It's also hard to know which value is 
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes

Reply via email to