[jira] [Updated] (RATIS-2524) Implement ReadIndex coalescing

Ivan Andika (Jira) Fri, 08 May 2026 01:01:12 -0700


     [ 
https://issues.apache.org/jira/browse/RATIS-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated RATIS-2524:
-------------------------------
    Description: 
We recently found that after an optimization that pushed the OM pure read 
performance 5x (240K performance), the pure follower read performance is worse 
than the leader-read performance (previously follower read can improves pure 
read throughput by 50%). The most possible suspect is due to the network and 
proto serde overhead. Each read will trigger a ReadIndex call. If network and 
serde overhead is high, this can be the bottleneck.

One improvement is to batch reads together to a single ReadIndex call.

Rule: A ReadIndex result may only serve reads whose invocation happened before 
the ReadIndex request is logically issued.


{code:java}
t1: read A arrives at follower
t2: read B arrives at follower
t3: follower sends one ReadIndex request for batch [A, B]
t4: leader processes ReadIndex and returns index I
t5: follower applies >= I
t6: A and B query local state and complete 
{code}

It's not

{code:java}
t1: read A arrives
t2: follower sends ReadIndex request
t3: leader processes it
t4: read B arrives
t5: follower attaches B to A's ReadIndex result 
{code}

This can be implemented using batching window with small batching interval 
(e.g. 500 microseconds or less depending on the average latency). We will batch 
the reads during the batching interval into one ReadIndex batch. After the 
batching interval is done, we will seal this ReadIndex batch (i.e. no more 
reads will be added into this read) and then we will send a ReadIndex that 
covers all the reads under the sealed window (e.g. if the window has 5 read 
requests then 1 ReadIndex call will amortize the cost of 5 ReadIndex calls). 
New reads will go to the next ReadIndex batch. 

The correctness is guaranteed since after the batch is sealed and the ReadIndex 
completes, all writes that precede the sync in the global order are locally 
applied, so the reads can be linearized at/after the sync point.

This idea is similar to the paper 
https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf (https://law-theorem.com/) 
mentioned in RATIS-2403 where the "sync" lightweight write operation is 
replaced with ReadIndex (which is also a form of "sync"). We can compare both 
approaches below:

Our coalesced ReadIndex design:

{code:java}
seal pending reads
send one ReadIndex
wait follower appliedIndex >= readIndex
execute all reads locally
{code}

Lazy-ALR:


{code:java}
seal pending reads
send one fake write / sync through write protocol
wait until sync would apply locally
execute all reads locally
{code}

Therefore while RATIS-2403 batch writes together into a single RepliedIndex to 
reduce the bottleneck introduced by high ReadIndex increase (and which causes 
longer follower waitForAdvance). This patch focuses on amortizing the network 
latency bottleneck for reads. Therefore, I think we can try to combine both 
improvements.

  was:
We recently found that after an optimization that pushed the OM pure read 
performance 5x (240K performance), the pure follower read performance is worse 
than the leader-read performance (previously follower read can improves pure 
read throughput by 50%). The most possible suspect is due to the network and 
proto serde overhead. Each read will trigger a ReadIndex call. If network and 
serde overhead is high, this can be the bottleneck.

One improvement is to batch reads together to a single ReadIndex call.

Rule: A ReadIndex result may only serve reads whose invocation happened before 
the ReadIndex request is logically issued.


{code:java}
t1: read A arrives at follower
t2: read B arrives at follower
t3: follower sends one ReadIndex request for batch [A, B]
t4: leader processes ReadIndex and returns index I
t5: follower applies >= I
t6: A and B query local state and complete 
{code}

It's not

{code:java}
t1: read A arrives
t2: follower sends ReadIndex request
t3: leader processes it
t4: read B arrives
t5: follower attaches B to A's ReadIndex result 
{code}

This can be implemented using batching window with small batching interval 
(e.g. 500 microseconds or less depending on the average latency). We will batch 
the reads during the batching interval into one ReadIndex batch. After the 
batching interval is done, we will seal this ReadIndex batch (i.e. no more 
reads will be added into this read) and then we will send a ReadIndex that 
covers all the reads under the sealed window (e.g. if the window has 5 read 
requests then 1 ReadIndex call will amortize the cost of 5 ReadIndex calls). 
New reads will go to the next ReadIndex batch. 

The correctness is guaranteed since after the batch is sealed and the ReadIndex 
completes, all writes that precede the sync in the global order are locally 
applied, so the reads can be linearized at/after the sync point.

This idea is similar to the paper 
https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf (https://law-theorem.com/) 
mentioned in RATIS-2403 where the "sync" lightweight write operation is 
replaced with ReadIndex (which is also a form of "sync"). We can compare both 
approaches below:

Our coalesced ReadIndex design:

{code:java}
seal pending reads
send one ReadIndex
wait follower appliedIndex >= readIndex
execute all reads locally
{code}

Lazy-ALR:


{code:java}
seal pending reads
send one fake write / sync through write protocol
wait until sync would apply locally
execute all reads locally
{code}

Therefore while RATIS-2403 batch writes together into a single RepliedIndex to 
reduce the bottleneck introduced by high ReadIndex increase (and which causes 
longer follower waitForAdvance). This patch focuses on amortizing the network 
latency bottleneck for reads.


> Implement ReadIndex coalescing
> ------------------------------
>
>                 Key: RATIS-2524
>                 URL: https://issues.apache.org/jira/browse/RATIS-2524
>             Project: Ratis
>          Issue Type: Improvement
>          Components: Linearizable Read, server
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> We recently found that after an optimization that pushed the OM pure read 
> performance 5x (240K performance), the pure follower read performance is 
> worse than the leader-read performance (previously follower read can improves 
> pure read throughput by 50%). The most possible suspect is due to the network 
> and proto serde overhead. Each read will trigger a ReadIndex call. If network 
> and serde overhead is high, this can be the bottleneck.
> One improvement is to batch reads together to a single ReadIndex call.
> Rule: A ReadIndex result may only serve reads whose invocation happened 
> before the ReadIndex request is logically issued.
> {code:java}
> t1: read A arrives at follower
> t2: read B arrives at follower
> t3: follower sends one ReadIndex request for batch [A, B]
> t4: leader processes ReadIndex and returns index I
> t5: follower applies >= I
> t6: A and B query local state and complete 
> {code}
> It's not
> {code:java}
> t1: read A arrives
> t2: follower sends ReadIndex request
> t3: leader processes it
> t4: read B arrives
> t5: follower attaches B to A's ReadIndex result 
> {code}
> This can be implemented using batching window with small batching interval 
> (e.g. 500 microseconds or less depending on the average latency). We will 
> batch the reads during the batching interval into one ReadIndex batch. After 
> the batching interval is done, we will seal this ReadIndex batch (i.e. no 
> more reads will be added into this read) and then we will send a ReadIndex 
> that covers all the reads under the sealed window (e.g. if the window has 5 
> read requests then 1 ReadIndex call will amortize the cost of 5 ReadIndex 
> calls). New reads will go to the next ReadIndex batch. 
> The correctness is guaranteed since after the batch is sealed and the 
> ReadIndex completes, all writes that precede the sync in the global order are 
> locally applied, so the reads can be linearized at/after the sync point.
> This idea is similar to the paper 
> https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf 
> (https://law-theorem.com/) mentioned in RATIS-2403 where the "sync" 
> lightweight write operation is replaced with ReadIndex (which is also a form 
> of "sync"). We can compare both approaches below:
> Our coalesced ReadIndex design:
> {code:java}
> seal pending reads
> send one ReadIndex
> wait follower appliedIndex >= readIndex
> execute all reads locally
> {code}
> Lazy-ALR:
> {code:java}
> seal pending reads
> send one fake write / sync through write protocol
> wait until sync would apply locally
> execute all reads locally
> {code}
> Therefore while RATIS-2403 batch writes together into a single RepliedIndex 
> to reduce the bottleneck introduced by high ReadIndex increase (and which 
> causes longer follower waitForAdvance). This patch focuses on amortizing the 
> network latency bottleneck for reads. Therefore, I think we can try to 
> combine both improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2524) Implement ReadIndex coalescing

Reply via email to