[ 
https://issues.apache.org/jira/browse/CASSANDRA-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152884#comment-16152884
 ] 

Aleksey Yeschenko commented on CASSANDRA-13794:
-----------------------------------------------

Work in progress branch 
[here|https://github.com/iamaleksey/cassandra/tree/13794-3.0]. Currently 
missing (new) tests, but I want to get the underlying logic reviewed and 
approved, first. Would add coverage before committing it.

A short summary of the issue: the code right now has two variables swapped, 
which ultimately results in us always fetching 1 extra row per short read 
protection requests, in a blocking manner, making it very inefficient. But upon 
closer look, there are some other inefficiencies here that can and should be 
addressed:

1. One of our stop conditions is {{lastCount == counter.counted()}}. It's 
supposed to abort a short read if our previous attempt to fetch more rows 
yielded 0 extra rows. It's not incorrect, but is only a special case of the 
more general scenario: our previous attempt to fetch more extra rows yielded 
fewer results than we requested for. That would mean there is no more rows to 
fetch at that replica, and allows us to abort earlier and more frequently.

2. Another of our stop conditions is {{!counter.isDoneForPartition()}}. Once 
again, it isn't incorrect, but it can be extended further. Due to the way 
{{isDoneForPartition()}} is defined ({{isDone() || rowInCurrentPartition >= 
perPartitionLimit}}) and because of that counter being counting-only, it is 
possible for us to have fetched enough rows total for other partitions short 
read retries previously to hit the global limit of rows in the counter. That 
would make {{isDone()}} return {{true}} always, and have 
{{isDoneForPartition()}} return false positives even if the partition currently 
processed only has a partition level deletion and/or tombstones. That can 
affect queries that set per partition limit explicitly or when running {{SELECT 
DISTINCT}} queries. Spotted that during CASSANDRA-13747 fixing.

3. Once we've swapped {{x}} and {{n}} in {{moreContents()}} to fix the logic 
error, we'd still have some issues. In degenerate cases where we have some 
nodes missing a fresh partition deletion, for example, the formula would fetch 
*a lot* of rows {{n * (n - 1)}}, with {{n}} growing exponentially with every 
attempt.

Upon closer inspection, the formula doesn't make 100% sense. It claims that we 
miss {{n - x}} rows - where {{n = counter.countedInCurrentPartition()}} and {{x 
= postReconciliationCounter.countedInCurrentPartition()}}, but the number we 
really miss is {{limit - postReconciliationCounter.counted()}} or 
{{perPartitionLimit - postReconciliationCounter.countedInCurrentPartition()}}. 
They might be the same on our first short read protection iteration, but will 
be diverging further and further with each request. In addition to that, it 
seems to assume a uniform distribution of tombstones (in the end result) rows 
in the source partition, which can't be true for most workloads.

I couldn't come up with some ideal heuristic that covers all workloads, so I 
stuck to something safe that respects client paging limits but still attempts 
to minimise the # of requests we make by fetching (in most cases) more rows 
than is minimally necessary. I'm not completely sure about it, but I welcome 
any ideas on how to make it better. Either way, anything we do should be 
significantly more efficient than what we have now.

I've also made some renames, refactorings, and moved a few things around to 
better understand the code myself, and make it clearer for future contributors 
- including future me. The most significant noticeable change is application of 
the per-response counter shift to {{mergeWithShortReadProtection()}} method, 
instead of overloading {{ShortReadRowProtection}} with responsibilities - I 
also like it to be next to the global counter creation, so you can see the 
contrast in arguments.

> Fix short read protection logic for querying more rows
> ------------------------------------------------------
>
>                 Key: CASSANDRA-13794
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13794
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Benedict
>            Assignee: Aleksey Yeschenko
>             Fix For: 3.0.x, 3.11.x
>
>
> Discovered by [~benedict] while reviewing CASSANDRA-13747:
> {quote}
> While reviewing I got a little suspicious of the modified line 
> {{DataResolver}} :479, as it seemed that n and x were the wrong way around... 
> and, reading the comment of intent directly above, and reproducing the 
> calculation, they are indeed.
> This is probably a significant enough bug that it warrants its own ticket for 
> record keeping, though I'm fairly agnostic on that decision.
> I'm a little concerned about our current short read behaviour, as right now 
> it seems we should be requesting exactly one row, for any size of under-read, 
> which could mean extremely poor performance in case of large under-reads.
> I would suggest that the outer unconditional {{Math.max}} is a bad idea, has 
> been (poorly) insulating us from this error, and that we should first be 
> asserting that the calculation yields a value >= 0 before setting to 1.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to