[ https://issues.apache.org/jira/browse/CASSANDRA-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152884#comment-16152884 ]
Aleksey Yeschenko commented on CASSANDRA-13794: ----------------------------------------------- Work in progress branch [here|https://github.com/iamaleksey/cassandra/tree/13794-3.0]. Currently missing (new) tests, but I want to get the underlying logic reviewed and approved, first. Would add coverage before committing it. A short summary of the issue: the code right now has two variables swapped, which ultimately results in us always fetching 1 extra row per short read protection requests, in a blocking manner, making it very inefficient. But upon closer look, there are some other inefficiencies here that can and should be addressed: 1. One of our stop conditions is {{lastCount == counter.counted()}}. It's supposed to abort a short read if our previous attempt to fetch more rows yielded 0 extra rows. It's not incorrect, but is only a special case of the more general scenario: our previous attempt to fetch more extra rows yielded fewer results than we requested for. That would mean there is no more rows to fetch at that replica, and allows us to abort earlier and more frequently. 2. Another of our stop conditions is {{!counter.isDoneForPartition()}}. Once again, it isn't incorrect, but it can be extended further. Due to the way {{isDoneForPartition()}} is defined ({{isDone() || rowInCurrentPartition >= perPartitionLimit}}) and because of that counter being counting-only, it is possible for us to have fetched enough rows total for other partitions short read retries previously to hit the global limit of rows in the counter. That would make {{isDone()}} return {{true}} always, and have {{isDoneForPartition()}} return false positives even if the partition currently processed only has a partition level deletion and/or tombstones. That can affect queries that set per partition limit explicitly or when running {{SELECT DISTINCT}} queries. Spotted that during CASSANDRA-13747 fixing. 3. Once we've swapped {{x}} and {{n}} in {{moreContents()}} to fix the logic error, we'd still have some issues. In degenerate cases where we have some nodes missing a fresh partition deletion, for example, the formula would fetch *a lot* of rows {{n * (n - 1)}}, with {{n}} growing exponentially with every attempt. Upon closer inspection, the formula doesn't make 100% sense. It claims that we miss {{n - x}} rows - where {{n = counter.countedInCurrentPartition()}} and {{x = postReconciliationCounter.countedInCurrentPartition()}}, but the number we really miss is {{limit - postReconciliationCounter.counted()}} or {{perPartitionLimit - postReconciliationCounter.countedInCurrentPartition()}}. They might be the same on our first short read protection iteration, but will be diverging further and further with each request. In addition to that, it seems to assume a uniform distribution of tombstones (in the end result) rows in the source partition, which can't be true for most workloads. I couldn't come up with some ideal heuristic that covers all workloads, so I stuck to something safe that respects client paging limits but still attempts to minimise the # of requests we make by fetching (in most cases) more rows than is minimally necessary. I'm not completely sure about it, but I welcome any ideas on how to make it better. Either way, anything we do should be significantly more efficient than what we have now. I've also made some renames, refactorings, and moved a few things around to better understand the code myself, and make it clearer for future contributors - including future me. The most significant noticeable change is application of the per-response counter shift to {{mergeWithShortReadProtection()}} method, instead of overloading {{ShortReadRowProtection}} with responsibilities - I also like it to be next to the global counter creation, so you can see the contrast in arguments. > Fix short read protection logic for querying more rows > ------------------------------------------------------ > > Key: CASSANDRA-13794 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13794 > Project: Cassandra > Issue Type: Bug > Components: Coordination > Reporter: Benedict > Assignee: Aleksey Yeschenko > Fix For: 3.0.x, 3.11.x > > > Discovered by [~benedict] while reviewing CASSANDRA-13747: > {quote} > While reviewing I got a little suspicious of the modified line > {{DataResolver}} :479, as it seemed that n and x were the wrong way around... > and, reading the comment of intent directly above, and reproducing the > calculation, they are indeed. > This is probably a significant enough bug that it warrants its own ticket for > record keeping, though I'm fairly agnostic on that decision. > I'm a little concerned about our current short read behaviour, as right now > it seems we should be requesting exactly one row, for any size of under-read, > which could mean extremely poor performance in case of large under-reads. > I would suggest that the outer unconditional {{Math.max}} is a bad idea, has > been (poorly) insulating us from this error, and that we should first be > asserting that the calculation yields a value >= 0 before setting to 1. > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org