[ https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sylvain Lebresne updated CASSANDRA-15432: ----------------------------------------- Test and Documentation Plan: No impact on testing as this is removing code and no test existed for the removed optimization. Afaict, the optimization was not documented, so no impact on documentation. Status: Patch Available (was: Open) > The "read defragmentation" optimization does not work > ----------------------------------------------------- > > Key: CASSANDRA-15432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15432 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths > Reporter: Sylvain Lebresne > Assignee: Sylvain Lebresne > Priority: Normal > > The so-called "read defragmentation" that has been added way back with > CASSANDRA-2503 actually does not work, and never has. That is, the > defragmentation writes do happen, but they only additional load on the nodes > without helping anything, and are thus a clear negative. > The "read defragmentation" (which only impact so-called "names queries") > kicks in when a read hits "too many" sstables (> 4 by default), and when it > does, it writes down the result of that read. The assumption being that the > next read for that data would only read the newly written data, which if not > still in memtable would at least be in a single sstable, thus speeding that > next read. > Unfortunately, this is not how this work. When we defrag and write the result > of our original read, we do so with the timestamp of the data read (as we > should, changing the timestamp would be plain wrong). And as a result, > following reads will read that data first, but will have no way to tell that > no more sstables should be read. Technically, the > [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] > call will not return {{null}} because the {{currentMaxTs}} will be higher > than at least some of the data in the result, and this until we've read from > as many sstables than in the original read. > I see no easy way to fix this. It might be possible to make it work with > additional per-sstable metadata, but nothing sufficiently simple and cheap to > be worth it comes to mind. And I thus suggest simply removing that code. > For the record, I'll note that there is actually a 2nd problem with that > code: currently, we "defrag" a read even if we didn't got data for everything > that the query requests. This also is "wrong" even if we ignore the first > issue: a following read that would read the defragmented data would also have > no way to know to not read more sstables to try to get the missing parts. > This problem would be fixeable, but is obviously overshadowed by the previous > one anyway. > Anyway, as mentioned, I suggest to just remove the "optimization" (which > again, never optimized anything) altogether, and happy to provide the simple > patch. > The only question might be in which versions? This impact all versions, but > this isn't a correction bug either, "just" a performance one. So do we want > 4.0 only or is there appetite for earlier? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org