[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

Sam Tunnicliffe (JIRA) Fri, 21 Aug 2015 06:45:13 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706745#comment-14706745
 ]


Sam Tunnicliffe commented on CASSANDRA-9459:
--------------------------------------------

Thanks [~slebresne], I appreciate it's a bit of a slog. I've pushed fixes that 
address most of your first set of points (too many to enumerate so I'll only 
mention the ones which I didn't just fix, but there's basically a new commit 
per-point in the branch if you want to verify any in particular).
                
bq. {{getBestIndexFor}} seems to now favor indexes that handle more of the 
expressions. I'm not convinced by that heuristic

The metric isn't so much how many of the original expressions can an Index 
handle, but the amount of filtering left to do on the results of the index 
lookup. It just so happens that all our current (built in) index 
implementations only reduce any filter to the degree of removing 1 expression 
(at most). In it's current incarnation {{RowFilter}} is limited to describing 
only a very simple expression tree, one with depth of 1 and containing only AND 
relations. As an effort to future-proof the Index interface somewhat I thought 
transforming the original filter into one representing what the index *can't* 
do would be a decent heuristic. The naive comparison in {{SIM}} which evaluates 
the reduced filters by number of expressions could safely be changed later 
without breaking custom index implementations (at least, that was the 
intention). The merits of attempting to future proof vs YAGNI are of course 
debatable, so if we agree this needs more consideration then I'll happily 
revert the selection criteria to simply consider {{estimateResultRows}}.

bq. The initialization of an {{Index}} bothers me a bit

bq. As for registration, index can do it directly in the ctor by calling 
{{baseCfs.indexManager.register()}} (we can also specify they have to do it). 

I'm not a fan of enforcing that kind of thing just by specification. Also, IoC 
is more explicit, cleaner and reduces dependencies. I *would* also argue that 
it improves testability but as I'm not actually testing this anywhere, I won't. 
Either way, I'm fine with requiring a constructor with a specific signature & 
removing the {{init}} method, but I'd prefer to keep {{register}} separate if 
you don't object too strongly.
Aside from that, I agree with all your points regarding 
construction/init/reload etc. On caveat is that {{SIM#reload}} can't just 
reload the indexes it knows about already, it has to check that every index 
defined in its base CFS's metadata is present. In the schema update from 
executing {{CreateIndexStatement}}, only the {{CFMetaData.indexes}} is updated, 
not the {{ColumnFamilyStore}} itself, which is just reloaded. When that reload 
happens then, the {{SIM}} needs to add the new index which is present in 
{{CFM.indexes}}, but not already registered. With that exception, I've made 
those changes in {{454bba7708018f872df825103aa52a99c8f653bd}}.

bq. Not a fan of using {{Optional}} as return type of 
{{Index.getReducedFilter}} as I would have expected intuitively that an empty 
optional would means the whole filter is reduced

As noted above, this whole approach to selection is up for debate so it may be 
that this simply becomes not a problem. That said, I would argue against a 
couple of your points here; I don't agree that an empty optional intuitively 
implies the total reduction of the filter, in that case an empty filter, not an 
empty optional, would seem most semantically correct to me. Secondly, it feels 
fragile to make assumptions about object equality in this way, especially in an 
extension point like this. I would rather not depend on documentation to 
enforce this sort of thing. Prior art for doing this *kind* of thing in C* is 
to return null when there's caller cannot satisfy a request and so using 
{{Optional}} instead seems pretty reasonable to me.

bq. I also don't find the {{Index.getReducedFilter}} naming too intuitive. I'd 
have prefer something like {{Index.getUnhandledExpressions}}

Again, the point of the method isn't necessarily to identify which of 
*original* expressions are unhandled, isn't it conceivable that a custom index 
could radically transform a filter into one containing an entirely disjoint set 
of expressions from the original?

bq. In {{CassandraIndex.indexFor}}, the implementations of {{insertRow}} and 
{{removeRow}} seems dangerous to me..if insertRow() is called with some 
tombstone, it will insert the cells instead of removing them for 
instance{{insertRow}} and {{updateRow}}.

Ok that's a good point, but can you clarify do you mean regular the row may 
contain regular tombstones (i.e. the result of "DELETE col FROM table WHERE 
..."?) If that is the case, then yes there is a bug currently in that we will 
try to insert an index entry for the (empty) value of the tombstone. I've fixed 
that (in {{4d7fa1b83a92f4f5067af5ac1693ebef48135ea0}}, but I don't quite get 
what you mean by "it will insert the cells instead of removing them for 
instance", as the tombstone has no value there's nothing we can remove - in 
{{updateRow}} we can remove the value from the existing row ofc. I also think 
we {{removeRow}} is still needed for cleanup & compaction

bq. Why do we care about {{IndexMetadata.TargetType}}?

The main driver for {{IndexMetadata.TargetType}} was to make explicit the 
difference between indexes with dynamic and static sets of target columns. We 
could infer that an index greedily processes all partition updates if its 
target columns is an empty set, but we discussed this in CASSANDRA-6717 and 
decided that explicitly indicating that was better (at least from a 
{{system_schema}} perspective).

bq. The fact that we have both {{IndexMetadata.TargetType}} and 
{{IndexTarget.TargetType}} is terribly confusing

It's pretty lame, but the best I could come up with was to rename 
{{IndexTarget.TargetType}} to {{IndexTarget.Type}}, it's only used within 
{{IndexTarget}} itself an {{CreateIndexStatement}} so maybe the extra info in 
the name is redundant.

bq. In {{AlterTableStatement.java}}, I'd rather make the {{if 
(index.columns.size() == 1)}} an assertion: we _will_ have to change this code 
once we support index on multiple columns (either by erroring out, or by 
actually dropping the index anyway), so I'd prefer having the code crash loudly 
if we forgot to update it rather than silently doing the wrong thing.

Good call, I almost did this in the first place, fixed.

bq. In {{Keyspace.getValidColumnFamilies}}, it would make more sense to me to 
have a {{SecondaryIndexManager.getIndexByName(String indexName)}} method and 
use that (as a side note, it's weird that the method doesn't actually use the 
'idxName' variable. Do you know what gives?).

Yes, that's something of a legacy from earlier versions before CASSANDRA-6717 
added {{IndexMetadata}} (at that point I'd completely removed the 
{{SIM.indexes}} map and missed here when adding it back. I'm not sure what you 
mean about 'idxName', looks like it's used to me (though reading this method 
was making my head hurt and there was a bug in it, so I've pushed a cleaned up 
version).

bq. Is there a reason for {{IndexRegistry}}? It doesn't hurt, but it doesn't 
seem of any concrete use to me (maybe it's for custom indexes, but even then 
I'm not really sure what it allows concretely).

Mentioned above re:IoC, if we agree that this is a reasonable approach, 
abstracting the registry-ness from {{SIM}} makes it much easier to use a 
lightweight implementation for tests etc (argument slightly undermined by not 
actually doing this anywhere).

bq. Your IDE seems to expand {{.*}} imports.

arrrgh, my settings had been lost and out of force of habit I'd been "fixing" 
the imports (or so I thought) every time I made a change. I'll make sure I do a 
final pass through all the modified files before this gets committed.

I'm nearly done with the points from your second comment, so I'll update again 
shortly

> SecondaryIndex API redesign
> ---------------------------
>
>                 Key: CASSANDRA-9459
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>             Fix For: 3.0 beta 1
>
>
> For some time now the index subsystem has been a pain point and in large part 
> this is due to the way that the APIs and principal classes have grown 
> organically over the years. It would be a good idea to conduct a wholesale 
> review of the area and see if we can come up with something a bit more 
> coherent.
> A few starting points:
> * There's a lot in AbstractPerColumnSecondaryIndex & its subclasses which 
> could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
> is done in CASSANDRA-8099).
> * SecondayIndexManager is overly complex and several of its functions should 
> be simplified/re-examined. The handling of which columns are indexed and 
> index selection on both the read and write paths are somewhat dense and 
> unintuitive.
> * The SecondaryIndex class hierarchy is rather convoluted and could use some 
> serious rework.
> There are a number of outstanding tickets which we should be able to roll 
> into this higher level one as subtasks (but I'll defer doing that until 
> getting into the details of the redesign):
> * CASSANDRA-7771
> * CASSANDRA-8103
> * CASSANDRA-9041
> * CASSANDRA-4458
> * CASSANDRA-8505
> Whilst they're not hard dependencies, I propose that this be done on top of 
> both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
> storage engine changes may facilitate a friendlier index API, but also 
> because of the changes to SIS mentioned above. As for 6717, the changes to 
> schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

Reply via email to