[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058852#comment-18058852
]
Xinyu Tan commented on RATIS-2403:
----------------------------------
[~ivanandika]
Sorry for the delayed reply — we’ve just been celebrating the Chinese New Year
here in China. Thank you again for your patience and for the thoughtful
discussion. I’d like to share a few points:
----
*1. Regarding follower read usage in IoTDB*
In IoTDB, we do not use follower read. Instead, we rely on linearizable read
and lease read.
The main reason is that our cluster size is typically on the order of dozens of
nodes. For control-plane nodes, although we usually have only a single Raft
group, we have not yet encountered bottlenecks at the Raft layer. For
data-plane nodes, we use a multi-Raft architecture to distribute load. As long
as there are no significant hotspots, read and write traffic can be evenly
distributed across leaders of different Raft groups, and therefore across all
nodes.
In some scenarios, we observed that enabling follower read requires maintaining
certain state machine cache states consistently across multiple replicas, which
introduces additional overhead. Considering all of this, we chose to use
multi-Raft to balance load at the data-plane level, and within each Raft group,
we keep both reads and writes on the leader to maximize resource efficiency.
----
*2. On the Ozone scenario*
In your Ozone scenario, if the system has not yet reached any physical
bottleneck (CPU, disk, network, etc.), why would enabling follower read
actually reduce query throughput?
Is it because write throughput becomes higher after enabling follower read? If
so, that would suggest we need to introduce some form of write throttling to
rebalance resource usage. If not, then it may indicate other bottlenecks, and
we probably need to profile the system more carefully to identify the root
cause.
----
*3. On leader throttling approaches*
I reviewed your {{{}leader-backpressure.patch{}}}. Personally, I don’t think
defining a maximum gap between the leader and the slowest follower is a
reasonable solution.
In a production environment, even if we configure a practical threshold, once a
follower has been down for a sufficiently long time, the gap will eventually
exceed the limit, which could block the entire Raft group from making further
progress on writes. This violates the liveness guarantees of Raft.
Moreover, in a three-replica Raft group, even if one follower lags
significantly behind, it can theoretically catch up later via snapshot
installation. Therefore, tying write availability to the slowest follower’s gap
does not seem like a robust throttling strategy.
Instead, I would prefer limiting the rate at which the leader advances the
{{{}commitIndex{}}}. For example, we could detect the advancement rate of
{{commitIndex}} inside {{{}tryAcquirePendingRequest{}}}, and if it is
progressing too quickly, temporarily block or slow down new writes.
Although this approach may require case-by-case tuning (e.g., how many write
ops per second to allow), it provides a clear upper bound on the resources
consumed by writes, thereby reserving sufficient headroom for read queries.
> Improve linearizable follower read throughput instead of writes
> ---------------------------------------------------------------
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
> Issue Type: Improvement
> Reporter: Ivan Andika
> Priority: Major
> Attachments: leader-backpressure.patch
>
>
> While benchmarking linearizable follower read, the observation is that the
> more requests go to the followers instead of the leader, the better write
> throughput becomes, we saw around 2-3x write throughput increase compared to
> the leader-only write and read (most likely due to less leader resource
> contention). However, the read throughput becomes worst than leader-only
> write and read (some can be below 0.2x). Even with optimizations such as
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379,
> the read throughput remains worse than leader-only write (it even improves
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only
> write and reads. Currently pure reads (no writes) performance improves read
> throughput up to 1.7x, but total follower read throughput is way below this
> target.
> Currently my ideas are
> * Sacrificing writes for reads: Can we limit the write QPS so that read QPS
> can increase
> ** From the benchmark, the read throughput only improves when write
> throughput is lower
> ** We can try to use backpressure mechanism so that writes do not advance so
> quickly that read throughput suffer
> *** Follower gap mechanisms (RATIS-1411), but this might cause leader to
> stall if follower down for a while (e.g. restarted), which violates the
> majority availability guarantee. It's also hard to know which value is
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)