[ 
https://issues.apache.org/jira/browse/RATIS-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated RATIS-2524:
-------------------------------
    Summary: Implement ReadIndex batching  (was: Implement ReadIndex coalescing)

> Implement ReadIndex batching
> ----------------------------
>
>                 Key: RATIS-2524
>                 URL: https://issues.apache.org/jira/browse/RATIS-2524
>             Project: Ratis
>          Issue Type: Improvement
>          Components: Linearizable Read, server
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> We recently found that after an optimization that pushed the OM pure read 
> performance 5x (240K performance), the pure follower read performance is 
> worse than the leader-read performance (previously follower read can improves 
> pure read throughput by 50%, now it decreased by 40%). I suspect it's due to 
> the network and ReadIndexProto proto serde overhead since now the read QPS is 
> way higher. Each read will trigger a ReadIndex call. If network and serde 
> overhead is high, this can be the bottleneck.
> One improvement is to batch reads together to a single ReadIndex call.
> Rule: A ReadIndex result may only serve reads whose invocation happened 
> before the ReadIndex request is logically issued.
> {code:java}
> t1: read A arrives at follower
> t2: read B arrives at follower
> t3: follower sends one ReadIndex request for batch [A, B]
> t4: leader processes ReadIndex and returns index I
> t5: follower applies >= I
> t6: A and B query local state and complete 
> {code}
> It's not
> {code:java}
> t1: read A arrives
> t2: follower sends ReadIndex request
> t3: leader processes it
> t4: read B arrives
> t5: follower attaches B to A's ReadIndex result 
> {code}
> This can be implemented using batching window with small batching interval 
> (e.g. 500 microseconds or less depending on the average latency) or batch 
> numbers (e.g. 64 batch reads). We will batch the reads during the batching 
> interval into one ReadIndex batch. After the batching interval is done, we 
> will seal this ReadIndex batch (i.e. no more reads will be added into this 
> read) and then we will send a ReadIndex that covers all the reads under the 
> sealed window (e.g. if the window has 5 read requests then 1 ReadIndex call 
> will amortize the cost of 5 ReadIndex calls). New reads will go to the next 
> ReadIndex batch. 
> The correctness is guaranteed since after the batch is sealed and the 
> ReadIndex completes, all writes that precede the sync in the global order are 
> locally applied, so the reads can be linearized at/after the sync point.
> This idea is similar to the paper 
> https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf 
> (https://law-theorem.com/) mentioned in RATIS-2403 where the "sync" 
> lightweight write operation is replaced with ReadIndex (which is also a form 
> of "sync"). We can compare both approaches below:
> Our coalesced ReadIndex design:
> {code:java}
> seal pending reads
> send one ReadIndex
> wait follower appliedIndex >= readIndex
> execute all reads locally
> {code}
> Lazy-ALR:
> {code:java}
> seal pending reads
> send one fake write / sync through write protocol
> wait until sync would apply locally
> execute all reads locally
> {code}
> Therefore while RATIS-2403 batch writes together into a single RepliedIndex 
> to reduce the bottleneck introduced by high ReadIndex increase (and which 
> causes longer follower waitForAdvance). This patch focuses on amortizing the 
> network latency bottleneck for reads. Therefore, I think we can try to 
> combine both improvements.
> One possible weakness is that now reads are more bursty since reads in a 
> batch are going to be executed at around the same time (instead of previously 
> where reads are going to be served immediately as they arrive).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to