Sylvain Lebresne created CASSANDRA-15432:
--------------------------------------------

             Summary: The "read defragmentation" optimization does not work
                 Key: CASSANDRA-15432
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15432
             Project: Cassandra
          Issue Type: Bug
            Reporter: Sylvain Lebresne


The so-called "read defragmentation" that has been added way back with 
CASSANDRA-2503 actually does not work, and never has. That is, the 
defragmentation writes do happen, but they only additional load on the nodes 
without helping anything, and are thus a clear negative.

The "read defragmentation" (which only impact so-called "names queries") kicks 
in when a read hits "too many" sstables (> 4 by default), and when it does, it 
writes down the result of that read. The assumption being that the next read 
for that data would only read the newly written data, which if not still in 
memtable would at least be in a single sstable, thus speeding that next read.

Unfortunately, this is not how this work. When we defrag and write the result 
of our original read, we do so with the timestamp of the data read (as we 
should, changing the timestamp would be plain wrong). And as a result, 
following reads will read that data first, but will have no way to tell that no 
more sstables should be read. Technically, the 
[{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830]
 call will not return {{null}} because the {{currentMaxTs}} will be higher than 
at least some of the data in the result, and this until we've read from as many 
sstables than in the original read.

I see no easy way to fix this. It might be possible to make it work with 
additional per-sstable metadata, but nothing sufficiently simple and cheap to 
be worth it comes to mind. And I thus suggest simply removing that code.

For the record, I'll note that there is actually a 2nd problem with that code: 
currently, we "defrag" a read even if we didn't got data for everything that 
the query requests. This also is "wrong" even if we ignore the first issue: a 
following read that would read the defragmented data would also have no way to 
know to not read more sstables to try to get the missing parts. This problem 
would be fixeable, but is obviously overshadowed by the previous one anyway.

Anyway, as mentioned, I suggest to just remove the "optimization" (which again, 
never optimized anything) altogether, and happy to provide the simple patch.

The only question might be in which versions? This impact all versions, but 
this isn't a correction bug either, "just" a performance one. So do we want 4.0 
only or is there appetite for earlier?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to