[ https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706745#comment-14706745 ]
Sam Tunnicliffe commented on CASSANDRA-9459: -------------------------------------------- Thanks [~slebresne], I appreciate it's a bit of a slog. I've pushed fixes that address most of your first set of points (too many to enumerate so I'll only mention the ones which I didn't just fix, but there's basically a new commit per-point in the branch if you want to verify any in particular). bq. {{getBestIndexFor}} seems to now favor indexes that handle more of the expressions. I'm not convinced by that heuristic The metric isn't so much how many of the original expressions can an Index handle, but the amount of filtering left to do on the results of the index lookup. It just so happens that all our current (built in) index implementations only reduce any filter to the degree of removing 1 expression (at most). In it's current incarnation {{RowFilter}} is limited to describing only a very simple expression tree, one with depth of 1 and containing only AND relations. As an effort to future-proof the Index interface somewhat I thought transforming the original filter into one representing what the index *can't* do would be a decent heuristic. The naive comparison in {{SIM}} which evaluates the reduced filters by number of expressions could safely be changed later without breaking custom index implementations (at least, that was the intention). The merits of attempting to future proof vs YAGNI are of course debatable, so if we agree this needs more consideration then I'll happily revert the selection criteria to simply consider {{estimateResultRows}}. bq. The initialization of an {{Index}} bothers me a bit bq. As for registration, index can do it directly in the ctor by calling {{baseCfs.indexManager.register()}} (we can also specify they have to do it). I'm not a fan of enforcing that kind of thing just by specification. Also, IoC is more explicit, cleaner and reduces dependencies. I *would* also argue that it improves testability but as I'm not actually testing this anywhere, I won't. Either way, I'm fine with requiring a constructor with a specific signature & removing the {{init}} method, but I'd prefer to keep {{register}} separate if you don't object too strongly. Aside from that, I agree with all your points regarding construction/init/reload etc. On caveat is that {{SIM#reload}} can't just reload the indexes it knows about already, it has to check that every index defined in its base CFS's metadata is present. In the schema update from executing {{CreateIndexStatement}}, only the {{CFMetaData.indexes}} is updated, not the {{ColumnFamilyStore}} itself, which is just reloaded. When that reload happens then, the {{SIM}} needs to add the new index which is present in {{CFM.indexes}}, but not already registered. With that exception, I've made those changes in {{454bba7708018f872df825103aa52a99c8f653bd}}. bq. Not a fan of using {{Optional}} as return type of {{Index.getReducedFilter}} as I would have expected intuitively that an empty optional would means the whole filter is reduced As noted above, this whole approach to selection is up for debate so it may be that this simply becomes not a problem. That said, I would argue against a couple of your points here; I don't agree that an empty optional intuitively implies the total reduction of the filter, in that case an empty filter, not an empty optional, would seem most semantically correct to me. Secondly, it feels fragile to make assumptions about object equality in this way, especially in an extension point like this. I would rather not depend on documentation to enforce this sort of thing. Prior art for doing this *kind* of thing in C* is to return null when there's caller cannot satisfy a request and so using {{Optional}} instead seems pretty reasonable to me. bq. I also don't find the {{Index.getReducedFilter}} naming too intuitive. I'd have prefer something like {{Index.getUnhandledExpressions}} Again, the point of the method isn't necessarily to identify which of *original* expressions are unhandled, isn't it conceivable that a custom index could radically transform a filter into one containing an entirely disjoint set of expressions from the original? bq. In {{CassandraIndex.indexFor}}, the implementations of {{insertRow}} and {{removeRow}} seems dangerous to me..if insertRow() is called with some tombstone, it will insert the cells instead of removing them for instance{{insertRow}} and {{updateRow}}. Ok that's a good point, but can you clarify do you mean regular the row may contain regular tombstones (i.e. the result of "DELETE col FROM table WHERE ..."?) If that is the case, then yes there is a bug currently in that we will try to insert an index entry for the (empty) value of the tombstone. I've fixed that (in {{4d7fa1b83a92f4f5067af5ac1693ebef48135ea0}}, but I don't quite get what you mean by "it will insert the cells instead of removing them for instance", as the tombstone has no value there's nothing we can remove - in {{updateRow}} we can remove the value from the existing row ofc. I also think we {{removeRow}} is still needed for cleanup & compaction bq. Why do we care about {{IndexMetadata.TargetType}}? The main driver for {{IndexMetadata.TargetType}} was to make explicit the difference between indexes with dynamic and static sets of target columns. We could infer that an index greedily processes all partition updates if its target columns is an empty set, but we discussed this in CASSANDRA-6717 and decided that explicitly indicating that was better (at least from a {{system_schema}} perspective). bq. The fact that we have both {{IndexMetadata.TargetType}} and {{IndexTarget.TargetType}} is terribly confusing It's pretty lame, but the best I could come up with was to rename {{IndexTarget.TargetType}} to {{IndexTarget.Type}}, it's only used within {{IndexTarget}} itself an {{CreateIndexStatement}} so maybe the extra info in the name is redundant. bq. In {{AlterTableStatement.java}}, I'd rather make the {{if (index.columns.size() == 1)}} an assertion: we _will_ have to change this code once we support index on multiple columns (either by erroring out, or by actually dropping the index anyway), so I'd prefer having the code crash loudly if we forgot to update it rather than silently doing the wrong thing. Good call, I almost did this in the first place, fixed. bq. In {{Keyspace.getValidColumnFamilies}}, it would make more sense to me to have a {{SecondaryIndexManager.getIndexByName(String indexName)}} method and use that (as a side note, it's weird that the method doesn't actually use the 'idxName' variable. Do you know what gives?). Yes, that's something of a legacy from earlier versions before CASSANDRA-6717 added {{IndexMetadata}} (at that point I'd completely removed the {{SIM.indexes}} map and missed here when adding it back. I'm not sure what you mean about 'idxName', looks like it's used to me (though reading this method was making my head hurt and there was a bug in it, so I've pushed a cleaned up version). bq. Is there a reason for {{IndexRegistry}}? It doesn't hurt, but it doesn't seem of any concrete use to me (maybe it's for custom indexes, but even then I'm not really sure what it allows concretely). Mentioned above re:IoC, if we agree that this is a reasonable approach, abstracting the registry-ness from {{SIM}} makes it much easier to use a lightweight implementation for tests etc (argument slightly undermined by not actually doing this anywhere). bq. Your IDE seems to expand {{.*}} imports. arrrgh, my settings had been lost and out of force of habit I'd been "fixing" the imports (or so I thought) every time I made a change. I'll make sure I do a final pass through all the modified files before this gets committed. I'm nearly done with the points from your second comment, so I'll update again shortly > SecondaryIndex API redesign > --------------------------- > > Key: CASSANDRA-9459 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9459 > Project: Cassandra > Issue Type: Improvement > Reporter: Sam Tunnicliffe > Assignee: Sam Tunnicliffe > Fix For: 3.0 beta 1 > > > For some time now the index subsystem has been a pain point and in large part > this is due to the way that the APIs and principal classes have grown > organically over the years. It would be a good idea to conduct a wholesale > review of the area and see if we can come up with something a bit more > coherent. > A few starting points: > * There's a lot in AbstractPerColumnSecondaryIndex & its subclasses which > could be pulled up into SecondaryIndexSearcher (note that to an extent, this > is done in CASSANDRA-8099). > * SecondayIndexManager is overly complex and several of its functions should > be simplified/re-examined. The handling of which columns are indexed and > index selection on both the read and write paths are somewhat dense and > unintuitive. > * The SecondaryIndex class hierarchy is rather convoluted and could use some > serious rework. > There are a number of outstanding tickets which we should be able to roll > into this higher level one as subtasks (but I'll defer doing that until > getting into the details of the redesign): > * CASSANDRA-7771 > * CASSANDRA-8103 > * CASSANDRA-9041 > * CASSANDRA-4458 > * CASSANDRA-8505 > Whilst they're not hard dependencies, I propose that this be done on top of > both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the > storage engine changes may facilitate a friendlier index API, but also > because of the changes to SIS mentioned above. As for 6717, the changes to > schema tables there will help facilitate CASSANDRA-7771. -- This message was sent by Atlassian JIRA (v6.3.4#6332)