[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079322#comment-18079322
]
Ivan Andika edited comment on RATIS-2403 at 5/8/26 5:53 AM:
------------------------------------------------------------
We did a benchmark using leader batch write with 10ms and 5ms batch interval.
We also allow lower consistency by skipping leadership check in leader
(RATIS-2382).
Observation
* For read dominated workload, read throughput improved, but write suffers
** This is slightly better than before, since before this improvements, read
throughput suffer even in read dominated workload
** However, we recently fixed a large bottleneck introduced by Hadoop metrics
that improved the leader read performance 5x (now we can reach around 230K QPS
from the previous 40K QPS). However after this improvement, the leader batch
write read throughput improvements disappear (write throughput reduced by 50%
and read throughput is reduced by 40%), although the read throughput decrease
is not as bad if we don't use leader batch write at all.
*** Interestingly previously read throughput improved in pure-reads (no writes)
workload, but after leader read performance was improved, the read throughput
in pure-reads workloads decreased. This suggests that ReadIndex network
overhead might be the overhead. The ReadIndex network overhead will try to be
addressed in RATIS-2524.
*** Even with the improvement of local read in RATIS-2509, the follower read
throughput do not improve that much (although RATIS-2509 improved leader read
performance).
*For write dominated workload, write throughput improved, but read suffers
* Reducing the batch interval from 10ms to 5ms causing read throughput to
suffer overall
Notable info
* The benchmark only tests for OM metadata only throughput (since it only
handle 0-sized key). In workloads with key with actual data, the metadata
latency increase might be amortized by the data write and read latency.
* The benchmark only tests for 100 clients, with higher clients, the read
throughput increase might be more noticeable
* The benchmark tests for mixed read and write clients, for read only clients,
the read throughput increase might be better. Conversely, for write only
clients, the write throughput might be worse.
* The benchmark only operates on a single bucket so the write throughput and
read throughput might be bottlenecked by the bucket lock. Multiple buckets
might help.
was (Author: JIRAUSER298977):
Observation
* For read dominated workload, read throughput improved, but write suffers
** This is slightly better than before, since before this improvements, read
throughput suffer even in read dominated workload
** However, we recently fixed a large bottleneck introduced by Hadoop metrics
that improved the leader read performance 5x (that's why we can reach around
230K QPS). After this improvement however, the leader batch write read
throughput improvements disappear (so both writes and reads throughput
decreased), although the read throughput decrease is not as bad if we don't use
leader batch write.
*For write dominated workload, write throughput improved, but read suffers
* Reducing the batch interval from 10ms to 5ms causing read throughput to
suffer overall
Notable info
* The benchmark only tests for OM metadata only throughput (since it only
handle 0-sized key). In workloads with key with actual data, the metadata
latency increase might be amortized by the data write and read latency.
* The benchmark only tests for 100 clients, with higher clients, the read
throughput increase might be more noticeable
* The benchmark tests for mixed read and write clients, for read only clients,
the read throughput increase might be better. Conversely, for write only
clients, the write throughput might be worse.
* The benchmark only operates on a single bucket so the write throughput and
read throughput might be bottlenecked by the bucket lock. Multiple buckets
might help.
> Improve linearizable follower read throughput instead of writes
> ---------------------------------------------------------------
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
> Issue Type: Improvement
> Components: Linearizable Read
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Fix For: 3.3.0
>
> Attachments: 1362_review.patch, 1362_review2.patch,
> LAW_THEOREM_RATIS_ANALYSIS.md, leader-backpressure.patch,
> leader-batch-write.patch
>
> Time Spent: 4h 10m
> Remaining Estimate: 0h
>
> While benchmarking linearizable follower read, the observation is that the
> more requests go to the followers instead of the leader, the better write
> throughput becomes, we saw around 2-3x write throughput increase compared to
> the leader-only write and read (most likely due to less leader resource
> contention). However, the read throughput becomes worst than leader-only
> write and read (some can be below 0.2x). Even with optimizations such as
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379,
> the read throughput remains worse than leader-only write (it even improves
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only
> write and reads. Currently pure reads (no writes) performance improves read
> throughput up to 1.7x, but total follower read throughput is way below this
> target.
> Currently my ideas are
> * Sacrificing writes for reads: Can we limit the write QPS so that read QPS
> can increase
> ** From the benchmark, the read throughput only improves when write
> throughput is lower
> ** We can try to use backpressure mechanism so that writes do not advance so
> quickly that read throughput suffer
> *** Follower gap mechanisms (RATIS-1411), but this might cause leader to
> stall if follower down for a while (e.g. restarted), which violates the
> majority availability guarantee. It's also hard to know which value is
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)