[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060285#comment-18060285
]
Ivan Andika commented on RATIS-2403:
------------------------------------
Thank you [~tanxinyu] and apologies for the late reply. Happy CNY, I was also
celebrating CNY last week.
{quote}*1. Regarding follower read usage in IoTDB*
{quote}
Thanks for the insight regarding the usage of linearizable read on IoTDB. The
control plane and data plane setup is similar to Ozone's OM and Datanodes,
where the control plane contains a single Raft group and the datanodes uses
multi-raft setup to distribute load. Currently, we are encountering bottleneck
on the control nodes (some high RPC queue issues on the OM leader mostly due to
the lock contentions (e.g. multiple keys operations need to get the same bucket
lock) which blocks caller threads). We intend to enable follower read to try to
improve the read throughput of OM, but it has the unintended effect of reducing
read throughput while increasing write throughput. For the data planes, similar
to IoTDB we also uses multi-Raft (a single Raft group is called a "Ratis
Pipeline") to balance load at the data-plane level, but unlike IoTDB the reads
on data-plane do not use Ratis to read the data, instead we use an internal
sequence ID / version (called Block Sequence ID) which is correspond to the
Raft log index, when user wants to read the key, it will check the BCSID stored
in the meta nodes and compare with the ones in the datanode replica, if it's
mismatched it means that the datanode replica is stale and client will pick
another node. The Raft groups in Ozone pipelines are also relatively
short-lived, where it can be destroyed and recreated in response to failures,
migration, etc.
{quote}*2. On the Ozone scenario*
{quote}
> Is it because write throughput becomes higher after enabling follower read?
Yes, this is the current behavior I observed.
{quote}{*}3. On leader throttling approaches{*}{*}{{*}}
{quote}
Yes, the maximum gap solution has a lot of issues (as you mentioned) and only
serves as a way to prove my hypothesis (that there is a inverse relationship
between write throughput and read throughput). I have attached the approach
suggested by [~szetszwo] [^leader-batch-write.patch] . Although it performs
better than the leader throttling approaches, it is also inflexible since we
don't know how frequent we need to batch write get the desired read and write
performance and the interval cannot dynamically adjust based on the current
workloads (read-heavy or write-heavy). I tested that 10ms improved the read
throughput 1.3-1.7x, but the write throughput can be degraded to 0.7x (one case
degrades to 0.3x), although there is one case where read and write throughput
improved by 1.x to 1.4x. When I decreased it to 5ms, the read QPS throughput
degraded again for write-heavy cases (although there are one case where both
read and write throughput improved by 1.2x-1.3x). Therefore, these solutions
require periodic and careful tuning which is not ideal and therefore might not
be good to be pushed to Ratis upstream.
I had some thoughts regarding the this issue and identified issues in my
benchmark setup. I use 100 clients on both the leader read setup (baseline) and
the follower-read setup (client can read from both leader and followers). The
realization is that 100 clients follower read will never be better than 100
client leader read unless the leader is already saturated (high resource usage
or RPC queue latency is higher than readindex) since leader will respond almost
immediately but follower will need to wait (even 1-2ms is a lot higher than
immediate return). A better setup for the follower reads might setup two
concurrent setups 1) 100 clients for leader reads 2) 200 clients for follower
reads, each 100 clients are assigned to each follower (assuming 2 followers 1
leader) so that each node is connected by 100 clients. If we can verify that
the leader is not degraded when we added the additional 200 clients to
exclusively to the followers, this means that follower read works when there
are new read throughput. That said, this also means that we cannot simply
enable follower reads on the existing workload since the existing workload read
throughput might be degraded. I will benchmark and validate this.
Another problem is that the benchmark only uses a single client which should
make the Ozone / Hadoop Rate Limiting ineffective. Ozone / Hadoop rate limiting
will deprioritize users (identified by username) to lower queue if they send a
lot requests (which is weighted based on things like how long the request holds
exclusive / write or shared / read locks), but if there is only one user, these
QoS will not work. Therefore, I'll try to change my benchmark to have different
users to see whether the Rate limiter will be able to balance the read and
writes. This requires that some reads to be sent to the leader since the rate
limiter is not distributed, so this also have some possible issues.
Getting read throughput improvements while not any sacrificing linearizability
is tricky. In the next step, I will try to attempt RATIS-2376 to support
timestamp-based staleness bound.
Let me know what you think. Thanks again for the discussion and the food for
thoughts.
> Improve linearizable follower read throughput instead of writes
> ---------------------------------------------------------------
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
> Issue Type: Improvement
> Reporter: Ivan Andika
> Priority: Major
> Attachments: leader-backpressure.patch, leader-batch-write.patch
>
>
> While benchmarking linearizable follower read, the observation is that the
> more requests go to the followers instead of the leader, the better write
> throughput becomes, we saw around 2-3x write throughput increase compared to
> the leader-only write and read (most likely due to less leader resource
> contention). However, the read throughput becomes worst than leader-only
> write and read (some can be below 0.2x). Even with optimizations such as
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379,
> the read throughput remains worse than leader-only write (it even improves
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only
> write and reads. Currently pure reads (no writes) performance improves read
> throughput up to 1.7x, but total follower read throughput is way below this
> target.
> Currently my ideas are
> * Sacrificing writes for reads: Can we limit the write QPS so that read QPS
> can increase
> ** From the benchmark, the read throughput only improves when write
> throughput is lower
> ** We can try to use backpressure mechanism so that writes do not advance so
> quickly that read throughput suffer
> *** Follower gap mechanisms (RATIS-1411), but this might cause leader to
> stall if follower down for a while (e.g. restarted), which violates the
> majority availability guarantee. It's also hard to know which value is
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)