[ 
https://issues.apache.org/jira/browse/CASSANDRA-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583530#comment-14583530
 ] 

Sylvain Lebresne commented on CASSANDRA-8099:
---------------------------------------------

Some update on this. I've pushed a rebased (and squashed because that made it a 
*lot* easier to rebase) version in [my 8099 
branch|https://github.com/pcmanus/cassandra/tree/8099].

It's still missing wire backward compatibility ([~thobbs] is finishing this so 
this should be ready hopefully soon). Regarding tests:
* unit tests are almost green: mostly it remains some failures in the hadoop 
tests. I could actually use the experience of someone that knows these tests 
and the code involved as it's not immediately clear to me what this is even 
doing.
* dtests still have a fair amount of failure but I've only look at them 
recently and it's getting down quickly.

h2. OpOrder

I think the main problem was that a local read (done through 
{{SP.LocalReadRunnable}}) was potentially keeping a group "open" while waiting 
on other nodes. I also realized this path meant local reads (the actual read of 
sstables) were done outside of the {{StorageProxy} methods, and so 1) not on 
the thread they were supposed to be on and 2) outside of the "timeout" check. I 
changed this so that a local response actually materialize everything upfront 
(similarly to what we do today), solving the problems above. This is not 
perfect and I'm sure we'll improve on this in the future, but that feels like a 
good enough option initially.

Regarding moving {{OpOrder}} out of {{close}}, the only way to do that I can 
see is be to move it up the stack, in {{ReadCommandVerbHandler}} and 
{{SP.LocalReadRunnable}} (as suggested by Brananir above). I'm working on that 
(I just started and might not have the time to finish today, but it'll be done 
early monday for sure).

h2. Branamir's review remarks

I've integrated fixes for most of the remarks. I discuss the rest below.

bq. [CompactionIterable 
125|https://github.com/pcmanus/cassandra/blob/75b98620e30b5df31431618cc21e090481f33967/src/java/org/apache/cassandra/db/compaction/CompactionIterable.java#L125]:
 I doubt index update belongs here, as side effect of iteration. Ideally index 
should be collected, not updated.

Though I don't disagree on principle, this is not different from how it's done 
currently (it's done the same in {{LazilyCompactRow}}, but it just happens that 
the old {{LazilyCompactedRow}} has been merged to {{CompactionIterable}} (now 
{{CompactionIterator}}) because simplifications of the patch made it 
unnecessary to have separate classes). Happy to look at cleaning this in a 
separate ticket however (probably belongs to cleaning the 2ndary index API in 
practice).

bq. [CompactionIterable 
237|https://github.com/pcmanus/cassandra/blob/75b98620e30b5df31431618cc21e090481f33967/src/java/org/apache/cassandra/db/compaction/CompactionIterable.java#L237]:
 Another side effect of iteration that preferably should be handled by writer.

Maybe, but it's not that simple. Merging (which is done directly by the 
{{CompactionIterator}}) gets rid of empty partitions and more generally we get 
rid of them as soon as possible. I think that it's the right thing to do as 
it's easier for the rest of the code, but that means we have to do invalidation 
in {{CompactionIterator}}. Of course, we could special case 
{{CompactionIterator}} to not remove empty partitions and do cache invalidation 
externally, but I'm not sure it would be cleaner overall (it would somewhat 
more error-prone). Besides, I could argue that cache invalidation is very much 
a product of compaction and having it in {{CompactionIterator}} is not that bad.

bq. Validation compaction now uses CompactionIterable and thus has side effects 
(index & cache removal).

I've fixed that but I'll note for posterity that as far as I can tell, index 
removal is done for validation compaction on trunk (and all previous version) 
due to the use of {{LazilyCompactedRow}}. I've still disabled it (for anything 
that wasn't a true compaction) because I think that's the right thing to do, 
but that's a difference of this ticket.

bq. add that there is never content between two corresponding tombstone markers 
on any iterator.

That's mentioned in "Dealing with tombstones and shadowed cells". More 
precisely, that's what "it's part of the contract of an AtomIterator that it 
must not shadow it's own data" means. But I need to clean up/update the guide 
so I'll make sure to clarify further while at it.

bq. These objects should be Iterable instead. Having that would give clear 
separation between the iteration process and the entity-level data

Yes, it would be cleaner from that standpoint. And the use of iterators in the 
first place is indeed largely carried from the existing code, I just hadn't 
really though of the alternative tbh. I'll try to check next week how easily 
such change is. That said, I'm not sure the use of iterators directly is that 
confusing either, so if it turns hairy, I don't think it's worth blocking on 
this (that is, we can very well change that later).

> Refactor and modernize the storage engine
> -----------------------------------------
>
>                 Key: CASSANDRA-8099
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8099
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 3.0 beta 1
>
>         Attachments: 8099-nit
>
>
> The current storage engine (which for this ticket I'll loosely define as "the 
> code implementing the read/write path") is suffering from old age. One of the 
> main problem is that the only structure it deals with is the cell, which 
> completely ignores the more high level CQL structure that groups cell into 
> (CQL) rows.
> This leads to many inefficiencies, like the fact that during a reads we have 
> to group cells multiple times (to count on replica, then to count on the 
> coordinator, then to produce the CQL resultset) because we forget about the 
> grouping right away each time (so lots of useless cell names comparisons in 
> particular). But outside inefficiencies, having to manually recreate the CQL 
> structure every time we need it for something is hindering new features and 
> makes the code more complex that it should be.
> Said storage engine also has tons of technical debt. To pick an example, the 
> fact that during range queries we update {{SliceQueryFilter.count}} is pretty 
> hacky and error prone. Or the overly complex ways {{AbstractQueryPager}} has 
> to go into to simply "remove the last query result".
> So I want to bite the bullet and modernize this storage engine. I propose to 
> do 2 main things:
> # Make the storage engine more aware of the CQL structure. In practice, 
> instead of having partitions be a simple iterable map of cells, it should be 
> an iterable list of row (each being itself composed of per-column cells, 
> though obviously not exactly the same kind of cell we have today).
> # Make the engine more iterative. What I mean here is that in the read path, 
> we end up reading all cells in memory (we put them in a ColumnFamily object), 
> but there is really no reason to. If instead we were working with iterators 
> all the way through, we could get to a point where we're basically 
> transferring data from disk to the network, and we should be able to reduce 
> GC substantially.
> Please note that such refactor should provide some performance improvements 
> right off the bat but it's not it's primary goal either. It's primary goal is 
> to simplify the storage engine and adds abstraction that are better suited to 
> further optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to