[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

Sam Tunnicliffe (JIRA) Sat, 25 Jul 2015 03:39:57 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641503#comment-14641503
 ]


Sam Tunnicliffe commented on CASSANDRA-9459:
--------------------------------------------


I've pushed a branch [here|https://github.com/beobal/cassandra/tree/9459-wip] 
with some of the proposed api changes for this ticket. 
This is a fairly large patch, so I'll try to summarise the main changes below, 
but the key places to look at are in the {{org.apache.cassandra.index}} 
package, in particular.

* {{o.a.c.index.Index}}
* {{o.a.c.index.SecondaryIndexManager}}
* {{o.a.c.index.internal.CassandraIndexer}}

This patch is most definitely a work in progress, but I'd appreciate some 
feedback, especially on the general approach and high level API changes. 
[~sbtourist], [~adelapena] & [~xedin] in particular, I know you are likely have 
opinions on this, which would be good to hear.


h3. Flattened class hierarchy

Instead of:

{noformat}
                    SecondaryIndex
           _______________|________________
          |                                |
  PerRowSecondaryIndex           PerColumnSecondaryIndex
                                           |
                          AbstractSimplePerColumnSecondaryIndex
                                    _______|_______
                                   |               | 
                                KeysIndex    CompositesIndex
                               ____________________|___________________
                              |                    |                   |
                       CompositesIndexOnX   CompositesIndexOnY  
CompositesIndexOnZ 
{noformat}

We just have a single {{Index}} interface, with 2 inner interfaces {{Indexer}} 
and {{Searcher}}.
The specific differences between indexes on different types of columns in 
composite tables (i.e. all the {{CompositesIndexOnX}} implementations) have 
been abstracted into a set of stateless functions, defined in the 
{{ColumnIndexFunctions}} interface & with implementations for use with the 
various column types. As such, there is now just single {{Index}} 
implementation for all built-in indexes, {{CassandraIndex}} (I'm not sold on 
this name, but it follows precedent set by {{CassandraAuthorizer}} and 
{{CassandraRoleManager}}). 
A nice side effect is that {{KEYS}} indexes (for thrift/compact tables and, in 
CASSANDRA-8103, static column indexes) also fit into this pattern, so no need 
for another specialisation there. There are still separate searcher 
implementations for {{KEYS}} and {{COMPOSITES}} indexes, but there's a lot more 
commonality between them now (not as a result of this patch, that's an artifact 
of CASSANDRA-8099).

h3. Event driven, partition scoped updates

Instead of delivering updates to an index implementation per-partition (as 
previously with PRSI) or per-cell (PSCI), the write component of the index api 
is more closely aligned to a partition update of the underlying base data.

More specifically, when a partition is updated (either via a regular write, or 
during compaction) a series of events are (or may be) fired. An {{Index}} 
implementation is required to provide an event listener, whose interface is 
defined in {{Index.Indexer}}, to handle these events. The granularity of these 
events maps to a PartitionUpdate, so there are events that are fired on 
* partition delete
* range tombstone
* row inserted
* row updated
* row removed 

h3. Caveats/Missing/TBD/etc

* A major thing missing in this branch is CASSANDRA-7771 (multiple indexes 
per-column). Along with that, the plan is also to introduce true per-row 
indexes, where the index is not necessarily linked to *any* specific column. So 
until we start hashing that out a bit better, the way SIM represents the 
collection of Indexes is tbd.
* Related to that, once we've settled on how to define an Index's relationship 
with a Row (moving that out of ColumnDefinition), we can revisit caching & 
lookup optimisation in SIM. Right now, every time we look up an index we do and 
filter of all the registered indexes for the table. We can definitely improve 
this and will do so ASAP.
* The mechanism by which we select indexes at query time remains pretty 
restrictive. The query clauses being represented as a list of 
{{RowFilter.Expression}} means only AND conjunctions are supported. This limits 
the scope for query optimisation and makes it difficult to extend search 
capabilities in the future, like adding support for OR for example. I'd like to 
move to something more expressive to give us scope to improve this area in 
future tickets.
* The validation methods on Index need some work. Basically these were simply 
copied from the existing implementation, but they ought to be reworked to 
combine them into a single {{validate(partition_update)}} or at least into 
{{validate(partitionkey)}} and {{validate(row)}}.
* The index transaction classes in SIM (formerly the {{Updater}}), need to move 
to a more row-centric interface. To a degree this was waiting on 
CASSANDRA-9705, and so it's easier to do now that has been committed but there 
still some work to do on it. As [~slebresne] put it, we should "separate 
merging the rows from diffing the merge result to the merge inputs (to generate 
the proper index updates)". The goal is that {{IndexTransaction}} and 
{{CleanupTransaction}} end up with methods along the lines of 
{{onInserted(row)}} and {{onUpdated(oldRow, newRow, mergedRow)}}. 
* Also on those {{&#123;Index,Cleanup&#125;Transaction}} classes: 
implementation-wise, they should probably behave more inline with their naming 
and only push updates to the actual Indexers on commit. This would enable 
{{CleanupTransaction}} to perform its updates asynchronously (which would be 
fine for trimming stale entries during compaction). 
* Somewhat related to the above point is that when we update indexes during 
compaction, we're now holding open the OpOrder for far too long. Batching these 
updates and performing them async will help there, but CASSANDRA-9832 is the 
better solution. That's currently waiting on me for review, after which I'll 
roll it in here.
* This initial patch doesn't include any new tests. I think the new stuff is 
more easily unit-testable, but I've been concentrating on making sure the 
coarser grained, existing tests pass before getting into that.
* Some of the new stuff here makes fairly extensive use of streams and lambdas. 
Obviously there's some work to be done on assessing the performance 
implications of that, but I don't think that should hold up review of the main 
API changes. Everything using the new Java8 features can obviously be easily 
re-implemented using pre-8 methods, should that prove necessary.
* I haven't verified it properly yet, but I believe CASSANDRA-8717 is/was 
broken by CASSANDRA-8099. The post reconcilliation processing step is still 
there, but it looks like the code for scanning all ranges was removed from 
StorageProxy. I'll check on this and make sure we don't lose this functionality.

> SecondaryIndex API redesign
> ---------------------------
>
>                 Key: CASSANDRA-9459
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>             Fix For: 3.0 beta 1
>
>
> For some time now the index subsystem has been a pain point and in large part 
> this is due to the way that the APIs and principal classes have grown 
> organically over the years. It would be a good idea to conduct a wholesale 
> review of the area and see if we can come up with something a bit more 
> coherent.
> A few starting points:
> * There's a lot in AbstractPerColumnSecondaryIndex & its subclasses which 
> could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
> is done in CASSANDRA-8099).
> * SecondayIndexManager is overly complex and several of its functions should 
> be simplified/re-examined. The handling of which columns are indexed and 
> index selection on both the read and write paths are somewhat dense and 
> unintuitive.
> * The SecondaryIndex class hierarchy is rather convoluted and could use some 
> serious rework.
> There are a number of outstanding tickets which we should be able to roll 
> into this higher level one as subtasks (but I'll defer doing that until 
> getting into the details of the redesign):
> * CASSANDRA-7771
> * CASSANDRA-8103
> * CASSANDRA-9041
> * CASSANDRA-4458
> * CASSANDRA-8505
> Whilst they're not hard dependencies, I propose that this be done on top of 
> both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
> storage engine changes may facilitate a friendlier index API, but also 
> because of the changes to SIS mentioned above. As for 6717, the changes to 
> schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

Reply via email to