Sylvain Lebresne created CASSANDRA-11475:
--------------------------------------------

             Summary: MV code refactor
                 Key: CASSANDRA-11475
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11475
             Project: Cassandra
          Issue Type: Bug
            Reporter: Sylvain Lebresne
            Assignee: Sylvain Lebresne
             Fix For: 3.0.x, 3.x


While working on CASSANDRA-5546 I run into a problem with TTLs on MVs, which 
looking more closely is a bug of the MV code. But one thing leading to another 
I reviewed a good portion of the MV code and found the following correction 
problems:
* If a base row is TTLed then even if an update remove that TTL the view entry 
remained TTLed and expires, leading to an inconsistency.
* Due to calling the wrong ctor for {{LivenessInfo}}, when a TTL was set on the 
base table, the view entry was living twice as long as the TTL. Again leading 
to a temporary inconsistency.
* When reading existing data to compute view updates, all deletion informations 
are completely ignored (the code uses a {{PartitionIterator}} instead of an 
{{UnfilteredPartitionIterator}}). This is a serious issue since it means some 
deletions could be totally ignored as far as views are concerned especially 
when messages are delivered to a replica out of order. I'll note that while the 
2 previous points are relatively easy to fix, I didn't find an easy and clean 
way to fix this one on the current code.

Further, I think the MV code in general has inefficiencies/code complexities 
that should be avoidable:
* {{TemporalRow.Set}} is buffering both everything read and a pretty much 
complete copy of the updates. That's a potentially high memory requirement. We 
shouldn't have to copy the updates and we shouldn't buffer all reads but rather 
work incrementally.
* {{TemporalRow}}/{{TemporalRow.Set}}/{{TemporalCell}} classes are somewhat 
re-inventing the wheel. They are really just storing both an update we're doing 
and the corresponding existing data, but we already have 
{{Row}}/{{Partition}}/{{Cell}} for that. In practice, those {{Temporal*}} class 
generates a lot of allocations that we could avoid.
* The code from CASSANDRA-10060 to avoid multiple reads of the base table with 
multiple views doesn't work when the update has partition/range tombstones 
because the code uses {{TemporalRow.Set.setTombstonedExisting()}} to trigger 
reuse, but the {{TemporalRow.Set.withNewViewPrimaryKey()}} method is used 
between view and it does not preseve the {{hasTombstonedExisting}} flag.  But 
that oversight, which is trivial to fix, is kind of a good thing since if you 
fix it, you're left with a correction problem.
  The read done when there is a partition deletion depends on the view itself 
(if there is clustering filters in particular) and so reusing that read for 
other views is wrong. Which makes that whole reuse code really dodgy imo: the 
read for existing data is in {{View.java}}, suggesting that it depends on the 
view (which again, it does at least for partition deletion), but it shouldn't 
if we're going to reuse the result across multiple views.
* Even ignoring the previous point, we still potentially read the base table 
twice if the update mix both row updates and partition/range deletions, 
potentially re-reading the same values.
* It's probably more minor but the reading code is using {{QueryPager}}, which 
is probably an artifact of the initial version of the code being pre-8099, but 
it's not necessary anymore (the reads are local and locally we're completely 
iterator based), adding, especially when we do page. I'll note that despite 
using paging, the current code still buffers everything in {{TemporalRow.Set}} 
anyway .

Overall, I suspect trying to fix the problems above (particularly the fact that 
existing deletion infos are ignored) is only going to add complexity with the 
current code and we'd still have to fix the inefficiencies. So I propose a 
refactor of that code which does 2 main things:
# it removes all of {{TemporalRow}} and related classes. Instead, it directly 
uses the existing {{Row}} (with all its deletion infos) and the update being 
applied to it and compute the updates for the view from that. I submit that 
this is more clear/simple, but this also avoid copying every cell of both the 
existing and update data as a {{TemporalCell}}. We can also reuse codes like 
{{Rows.merge}} and {{Rows.diff}} to make the handling of deletions relatively 
painless.
# instead of dealing with each view one at a time, re-iterating over all 
updates each time, it iterates over each individual updates once and deal with 
each view for that update. This makes it more clear that the reads has to care 
about every view involved, but more importantly allow to deal with the read 
data incrementally, never buffering it all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to