[
https://issues.apache.org/jira/browse/RATIS-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-2524:
-------------------------------
Description:
We recently found that after an optimization that pushed the OM pure read
performance 5x (240K performance), the pure follower read performance is worse
than the leader-read performance (previously follower read can improves pure
read throughput by 50%, now it decreased by 40%). I suspect it's due to the
network and ReadIndexProto proto serde overhead. Each read will trigger a
ReadIndex call. If network and serde overhead is high, this can be the
bottleneck.
One improvement is to batch reads together to a single ReadIndex call.
Rule: A ReadIndex result may only serve reads whose invocation happened before
the ReadIndex request is logically issued.
{code:java}
t1: read A arrives at follower
t2: read B arrives at follower
t3: follower sends one ReadIndex request for batch [A, B]
t4: leader processes ReadIndex and returns index I
t5: follower applies >= I
t6: A and B query local state and complete
{code}
It's not
{code:java}
t1: read A arrives
t2: follower sends ReadIndex request
t3: leader processes it
t4: read B arrives
t5: follower attaches B to A's ReadIndex result
{code}
This can be implemented using batching window with small batching interval
(e.g. 500 microseconds or less depending on the average latency) or batch
numbers (e.g. 64 batch reads). We will batch the reads during the batching
interval into one ReadIndex batch. After the batching interval is done, we will
seal this ReadIndex batch (i.e. no more reads will be added into this read) and
then we will send a ReadIndex that covers all the reads under the sealed window
(e.g. if the window has 5 read requests then 1 ReadIndex call will amortize the
cost of 5 ReadIndex calls). New reads will go to the next ReadIndex batch.
The correctness is guaranteed since after the batch is sealed and the ReadIndex
completes, all writes that precede the sync in the global order are locally
applied, so the reads can be linearized at/after the sync point.
This idea is similar to the paper
https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf (https://law-theorem.com/)
mentioned in RATIS-2403 where the "sync" lightweight write operation is
replaced with ReadIndex (which is also a form of "sync"). We can compare both
approaches below:
Our coalesced ReadIndex design:
{code:java}
seal pending reads
send one ReadIndex
wait follower appliedIndex >= readIndex
execute all reads locally
{code}
Lazy-ALR:
{code:java}
seal pending reads
send one fake write / sync through write protocol
wait until sync would apply locally
execute all reads locally
{code}
Therefore while RATIS-2403 batch writes together into a single RepliedIndex to
reduce the bottleneck introduced by high ReadIndex increase (and which causes
longer follower waitForAdvance). This patch focuses on amortizing the network
latency bottleneck for reads. Therefore, I think we can try to combine both
improvements.
One possible weakness is that now reads are more bursty since reads in a batch
are going to be executed at around the same time (instead of previously where
reads are going to be served immediately as they arrive).
was:
We recently found that after an optimization that pushed the OM pure read
performance 5x (240K performance), the pure follower read performance is worse
than the leader-read performance (previously follower read can improves pure
read throughput by 50%). The most possible suspect is due to the network and
proto serde overhead. Each read will trigger a ReadIndex call. If network and
serde overhead is high, this can be the bottleneck.
One improvement is to batch reads together to a single ReadIndex call.
Rule: A ReadIndex result may only serve reads whose invocation happened before
the ReadIndex request is logically issued.
{code:java}
t1: read A arrives at follower
t2: read B arrives at follower
t3: follower sends one ReadIndex request for batch [A, B]
t4: leader processes ReadIndex and returns index I
t5: follower applies >= I
t6: A and B query local state and complete
{code}
It's not
{code:java}
t1: read A arrives
t2: follower sends ReadIndex request
t3: leader processes it
t4: read B arrives
t5: follower attaches B to A's ReadIndex result
{code}
This can be implemented using batching window with small batching interval
(e.g. 500 microseconds or less depending on the average latency) or batch
numbers (e.g. 64 batch reads). We will batch the reads during the batching
interval into one ReadIndex batch. After the batching interval is done, we will
seal this ReadIndex batch (i.e. no more reads will be added into this read) and
then we will send a ReadIndex that covers all the reads under the sealed window
(e.g. if the window has 5 read requests then 1 ReadIndex call will amortize the
cost of 5 ReadIndex calls). New reads will go to the next ReadIndex batch.
The correctness is guaranteed since after the batch is sealed and the ReadIndex
completes, all writes that precede the sync in the global order are locally
applied, so the reads can be linearized at/after the sync point.
This idea is similar to the paper
https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf (https://law-theorem.com/)
mentioned in RATIS-2403 where the "sync" lightweight write operation is
replaced with ReadIndex (which is also a form of "sync"). We can compare both
approaches below:
Our coalesced ReadIndex design:
{code:java}
seal pending reads
send one ReadIndex
wait follower appliedIndex >= readIndex
execute all reads locally
{code}
Lazy-ALR:
{code:java}
seal pending reads
send one fake write / sync through write protocol
wait until sync would apply locally
execute all reads locally
{code}
Therefore while RATIS-2403 batch writes together into a single RepliedIndex to
reduce the bottleneck introduced by high ReadIndex increase (and which causes
longer follower waitForAdvance). This patch focuses on amortizing the network
latency bottleneck for reads. Therefore, I think we can try to combine both
improvements.
One possible weakness is that now reads are more bursty since reads in a batch
are going to be executed at around the same time (instead of previously where
reads are going to be served immediately as they arrive).
> Implement ReadIndex coalescing
> ------------------------------
>
> Key: RATIS-2524
> URL: https://issues.apache.org/jira/browse/RATIS-2524
> Project: Ratis
> Issue Type: Improvement
> Components: Linearizable Read, server
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> We recently found that after an optimization that pushed the OM pure read
> performance 5x (240K performance), the pure follower read performance is
> worse than the leader-read performance (previously follower read can improves
> pure read throughput by 50%, now it decreased by 40%). I suspect it's due to
> the network and ReadIndexProto proto serde overhead. Each read will trigger a
> ReadIndex call. If network and serde overhead is high, this can be the
> bottleneck.
> One improvement is to batch reads together to a single ReadIndex call.
> Rule: A ReadIndex result may only serve reads whose invocation happened
> before the ReadIndex request is logically issued.
> {code:java}
> t1: read A arrives at follower
> t2: read B arrives at follower
> t3: follower sends one ReadIndex request for batch [A, B]
> t4: leader processes ReadIndex and returns index I
> t5: follower applies >= I
> t6: A and B query local state and complete
> {code}
> It's not
> {code:java}
> t1: read A arrives
> t2: follower sends ReadIndex request
> t3: leader processes it
> t4: read B arrives
> t5: follower attaches B to A's ReadIndex result
> {code}
> This can be implemented using batching window with small batching interval
> (e.g. 500 microseconds or less depending on the average latency) or batch
> numbers (e.g. 64 batch reads). We will batch the reads during the batching
> interval into one ReadIndex batch. After the batching interval is done, we
> will seal this ReadIndex batch (i.e. no more reads will be added into this
> read) and then we will send a ReadIndex that covers all the reads under the
> sealed window (e.g. if the window has 5 read requests then 1 ReadIndex call
> will amortize the cost of 5 ReadIndex calls). New reads will go to the next
> ReadIndex batch.
> The correctness is guaranteed since after the batch is sealed and the
> ReadIndex completes, all writes that precede the sync in the global order are
> locally applied, so the reads can be linearized at/after the sync point.
> This idea is similar to the paper
> https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf
> (https://law-theorem.com/) mentioned in RATIS-2403 where the "sync"
> lightweight write operation is replaced with ReadIndex (which is also a form
> of "sync"). We can compare both approaches below:
> Our coalesced ReadIndex design:
> {code:java}
> seal pending reads
> send one ReadIndex
> wait follower appliedIndex >= readIndex
> execute all reads locally
> {code}
> Lazy-ALR:
> {code:java}
> seal pending reads
> send one fake write / sync through write protocol
> wait until sync would apply locally
> execute all reads locally
> {code}
> Therefore while RATIS-2403 batch writes together into a single RepliedIndex
> to reduce the bottleneck introduced by high ReadIndex increase (and which
> causes longer follower waitForAdvance). This patch focuses on amortizing the
> network latency bottleneck for reads. Therefore, I think we can try to
> combine both improvements.
> One possible weakness is that now reads are more bursty since reads in a
> batch are going to be executed at around the same time (instead of previously
> where reads are going to be served immediately as they arrive).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)