[jira] [Created] (CASSANDRA-15432) The "read defragmentation" optimization does not work

Sylvain Lebresne (Jira) Mon, 18 Nov 2019 05:36:19 -0800

Sylvain Lebresne created CASSANDRA-15432:
--------------------------------------------

Summary: The "read defragmentation" optimization does not work
Key: CASSANDRA-15432
URL: https://issues.apache.org/jira/browse/CASSANDRA-15432
Project: Cassandra
Issue Type: Bug
Reporter: Sylvain Lebresne

The so-called "read defragmentation" that has been added way back with
CASSANDRA-2503 actually does not work, and never has. That is, the
defragmentation writes do happen, but they only additional load on the nodes
without helping anything, and are thus a clear negative.

The "read defragmentation" (which only impact so-called "names queries") kicks
in when a read hits "too many" sstables (> 4 by default), and when it does, it
writes down the result of that read. The assumption being that the next read
for that data would only read the newly written data, which if not still in
memtable would at least be in a single sstable, thus speeding that next read.

Unfortunately, this is not how this work. When we defrag and write the result
of our original read, we do so with the timestamp of the data read (as we
should, changing the timestamp would be plain wrong). And as a result,
following reads will read that data first, but will have no way to tell that no
more sstables should be read. Technically, the
[{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830]
call will not return {{null}} because the {{currentMaxTs}} will be higher than
at least some of the data in the result, and this until we've read from as many
sstables than in the original read.

I see no easy way to fix this. It might be possible to make it work with
additional per-sstable metadata, but nothing sufficiently simple and cheap to
be worth it comes to mind. And I thus suggest simply removing that code.

For the record, I'll note that there is actually a 2nd problem with that code:
currently, we "defrag" a read even if we didn't got data for everything that
the query requests. This also is "wrong" even if we ignore the first issue: a
following read that would read the defragmented data would also have no way to
know to not read more sstables to try to get the missing parts. This problem
would be fixeable, but is obviously overshadowed by the previous one anyway.

Anyway, as mentioned, I suggest to just remove the "optimization" (which again,
never optimized anything) altogether, and happy to provide the simple patch.

The only question might be in which versions? This impact all versions, but
this isn't a correction bug either, "just" a performance one. So do we want 4.0
only or is there appetite for earlier?

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-15432) The "read defragmentation" optimization does not work

Reply via email to