[ 
https://issues.apache.org/jira/browse/CASSANDRA-15774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-15774:
---------------------------------
    Fix Version/s: 4.x

> Improve range reads to query by endpoints instead of vnodes to reduce number 
> of remote requests
> -----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15774
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Legacy/Coordination
>            Reporter: ZhaoYang
>            Priority: Normal
>             Fix For: 4.x
>
>
> Currently, range read queries in batches, see 
> {{StorageProxy.RangeCommandIterator#sendNextRequests()}}. For each batch, it 
> computes a list of merged vnode ranges up to concurrency factor and query 
> each merged vnode range asynchronously. (note: consecutive vnode ranges can 
> be merged if they share enough replicas to satisfy consistency level 
> requirement)
> This works fine in general, but when concurrency factor is high because 
> returned row count is small comparing to query limit or index filtering is 
> used, coordinator may send too many concurrent remote range requests in a 
> batch.
> We can improve it by grouping remote range requests by endpoints where each 
> endpoint will return response corresponding to multiple non-consecutive 
> ranges. With endpoint grouping, number of remote range requests should 
> largely reduced and it's always capped by number of nodes in the cluster 
> instead of number of ranges which is capped by concurrency factor.
> Let's look at an example on a 5-node cluster with 10 
> ranges(a,b,c,d,e,f,g,h,i,h) and rf3.
> Following is the range to replica mapping using round robin that should work 
> well with consecutive range merger (consecutive range merger doesn't work 
> well with fully random replica mapping, because it's less likely to have 
> overlapping replicas for consecutive ranges)
> {code:java}
>    range-a replicas: 1, 2, 3
>    range-b replicas: 2, 3, 4
>    range-c replicas: 3, 4, 5
>    range-d replicas: 1, 4, 5
>    range-e replicas: 1, 2, 5
>    range-f replicas: 1, 2, 3
>    range-g replicas: 2, 3, 4
>    range-h replicas: 3, 4, 5
>    range-i replicas: 1, 4, 5
>    range-j replicas: 1, 2, 5
> {code}
> With default range read implementation and consecutive range merger, we need 
> 10 replica read requests(2 for each merged range) for quorum:
> {code:java}
>      range (a,b] on node [2, 3]
>      range (c,d] on node [4, 5]
>      range (e,f] on node [1, 2]
>      range (g,h] on node [3, 4]
>      range (i,j] on node [1, 5]
> {code}
> With group query by endpoints, we only need 4 replica read requests for 
> quorum:
> {code:java}
>     * node 1: a, d, e, f, i, j
>     * node 2: a, b, e, f, g, j
>     * node 3: b, c, g, h
>     * node 4: c, d, h, i
> {code}
>  
> Note that there are some complexities around short-read protection which 
> needs to know whether replica has more rows available for current range.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to