[ https://issues.apache.org/jira/browse/CASSANDRA-15774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ZhaoYang updated CASSANDRA-15774: --------------------------------- Fix Version/s: 4.x > Improve range reads to query by endpoints instead of vnodes to reduce number > of remote requests > ----------------------------------------------------------------------------------------------- > > Key: CASSANDRA-15774 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15774 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Coordination > Reporter: ZhaoYang > Priority: Normal > Fix For: 4.x > > > Currently, range read queries in batches, see > {{StorageProxy.RangeCommandIterator#sendNextRequests()}}. For each batch, it > computes a list of merged vnode ranges up to concurrency factor and query > each merged vnode range asynchronously. (note: consecutive vnode ranges can > be merged if they share enough replicas to satisfy consistency level > requirement) > This works fine in general, but when concurrency factor is high because > returned row count is small comparing to query limit or index filtering is > used, coordinator may send too many concurrent remote range requests in a > batch. > We can improve it by grouping remote range requests by endpoints where each > endpoint will return response corresponding to multiple non-consecutive > ranges. With endpoint grouping, number of remote range requests should > largely reduced and it's always capped by number of nodes in the cluster > instead of number of ranges which is capped by concurrency factor. > Let's look at an example on a 5-node cluster with 10 > ranges(a,b,c,d,e,f,g,h,i,h) and rf3. > Following is the range to replica mapping using round robin that should work > well with consecutive range merger (consecutive range merger doesn't work > well with fully random replica mapping, because it's less likely to have > overlapping replicas for consecutive ranges) > {code:java} > range-a replicas: 1, 2, 3 > range-b replicas: 2, 3, 4 > range-c replicas: 3, 4, 5 > range-d replicas: 1, 4, 5 > range-e replicas: 1, 2, 5 > range-f replicas: 1, 2, 3 > range-g replicas: 2, 3, 4 > range-h replicas: 3, 4, 5 > range-i replicas: 1, 4, 5 > range-j replicas: 1, 2, 5 > {code} > With default range read implementation and consecutive range merger, we need > 10 replica read requests(2 for each merged range) for quorum: > {code:java} > range (a,b] on node [2, 3] > range (c,d] on node [4, 5] > range (e,f] on node [1, 2] > range (g,h] on node [3, 4] > range (i,j] on node [1, 5] > {code} > With group query by endpoints, we only need 4 replica read requests for > quorum: > {code:java} > * node 1: a, d, e, f, i, j > * node 2: a, b, e, f, g, j > * node 3: b, c, g, h > * node 4: c, d, h, i > {code} > > Note that there are some complexities around short-read protection which > needs to know whether replica has more rows available for current range. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org