subject:"\[jira\] \[Commented\] \(CASSANDRA\-9459\) SecondaryIndex API redesign"

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-09-02 Thread Andrew Hust (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727859#comment-14727859
 ] 

Andrew Hust commented on CASSANDRA-9459:


When creating separate indexes on both the key and value of a map column the 
ddl for the table in cqlsh only contains the index on the value.  Both indexes 
are functional and queries return expected results.  When querying metadata 
from the python driver (3.0.0a2) both indexes are present and using the 
function as_cql_query produces the correct ddl.  This might just be an out of 
date python lib in cqlsh.

Tested on C*: {{66b0e1d7889d0858753c6e364e77d86fe278eee4}}

Can be reproduced with the following shell commands and ccm:
{code}
ccm remove 2i_test
ccm create -n 1 -v git:cassandra-3.0 -s 2i_test
ccm start

cat << EOF | ccm node1 cqlsh
CREATE KEYSPACE index_test_ks WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 1};
USE index_test_ks;
CREATE TABLE tbl1 (id uuid primary key, ds map, c1 int);
INSERT INTO tbl1 (id, ds, c1) values (uuid(), {'foo': 1, 'bar': 2}, 1);
INSERT INTO tbl1 (id, ds, c1) values (uuid(), {'faz': 1, 'baz': 2}, 2);
CREATE INDEX ix_tbl1_map_values ON tbl1(ds);
CREATE INDEX ix_tbl1_map_keys ON tbl1(keys(ds));

SELECT * FROM tbl1 where ds contains 1;
SELECT * FROM tbl1 where ds contains key 'foo';

// ***
// DDL only has ix_tbl1_map_values present
// ***
DESC TABLE tbl1;

// ***
// system_schema.indexes is correct
// ***
SELECT * FROM system_schema.indexes;
EOF
ccm stop
{code}

Example output:
{code}
CREATE TABLE index_test_ks.tbl1 (
id uuid PRIMARY KEY,
c1 int,
ds map
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE INDEX ix_tbl1_map_values ON index_test_ks.tbl1 (ds);


 keyspace_name | table_name | index_name | index_type | options 
 | target_columns | target_type
---++++--++-
 index_test_ks |   tbl1 |   ix_tbl1_map_keys | COMPOSITES |   
{'index_keys': ''} | {'ds'} |  COLUMN
 index_test_ks |   tbl1 | ix_tbl1_map_values | COMPOSITES | 
{'index_values': ''} | {'ds'} |  COLUMN
{code}

> SecondaryIndex API redesign
> ---
>
> Key: CASSANDRA-9459
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Sam Tunnicliffe
>Assignee: Sam Tunnicliffe
> Fix For: 3.0 beta 1
>
>
> For some time now the index subsystem has been a pain point and in large part 
> this is due to the way that the APIs and principal classes have grown 
> organically over the years. It would be a good idea to conduct a wholesale 
> review of the area and see if we can come up with something a bit more 
> coherent.
> A few starting points:
> * There's a lot in AbstractPerColumnSecondaryIndex & its subclasses which 
> could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
> is done in CASSANDRA-8099).
> * SecondayIndexManager is overly complex and several of its functions should 
> be simplified/re-examined. The handling of which columns are indexed and 
> index selection on both the read and write paths are somewhat dense and 
> unintuitive.
> * The SecondaryIndex class hierarchy is rather convoluted and could use some 
> serious rework.
> There are a number of outstanding tickets which we should be able to roll 
> into this higher level one as subtasks (but I'll defer doing that until 
> getting into the details of the redesign):
> * CASSANDRA-7771
> * CASSANDRA-8103
> * CASSANDRA-9041
> * CASSANDRA-4458
> * CASSANDRA-8505
> Whilst they're not hard dependencies, I propose that this be done on top of 
> both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
> storage engine changes may facilitate a friendlier index API, but also 
> because of the changes to SIS mentioned above. As for 6717, the changes to 
> schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-24 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708950#comment-14708950
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


[~benedict] you're right, I actually had it on my todo list for this to revist 
the (over)use of streams, but it slipped through in the end. I'll clean it up 
in CASSANDRA-10124 (or failing that, open a separate ticket)


 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-23 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708330#comment-14708330
 ] 

Benedict commented on CASSANDRA-9459:
-

A nit question relating to this ticket: In each of the main Transaction 
implementations we seem to be making unnecessary use of streams. These are a 
hot code paths on systems using indexes, and I'm not convinced we should be 
incurring the GC burden, especially when clarity isn't meaningfully improved. 
Compare

{code}
try (OpOrder.Group opGroup = Keyspace.writeOrder.start())
{
Index.Indexer[] indexers = Arrays.stream(indexes)
 .map(i - i.indexerFor(key, 
nowInSec, opGroup, Type.CLEANUP))
 .toArray(Index.Indexer[]::new);

Arrays.stream(indexers).forEach(Index.Indexer::begin);

if (partitionDelete != null)
Arrays.stream(indexers).forEach(indexer - 
indexer.partitionDelete(partitionDelete));

if (row != null)
Arrays.stream(indexers).forEach(indexer - 
indexer.removeRow(row));

Arrays.stream(indexers).forEach(Index.Indexer::finish);
}
{code}

with

{code}
try (OpOrder.Group opGroup = Keyspace.writeOrder.start())
{
for (Index index : indexes)
{
Index.Indexer indexer = index.indexerFor(key, nowInSec, 
opGroup, Type.CLEANUP);
indexer.begin();
if (partitionDelete != null)
indexer.partitionDelete(partitionDelete);
if (row != null)
indexer.removeRow(row);
indexer.finish();
}
}
{code} 

I'm pretty convinced the latter is clearer, and in the former I count at least 
8 unnecessary allocations for the first stream, and 4 for each of the rest. A 
few of these allocations are = 64Kb, and I estimate total allocation per row 
on the order of (but probably a little less than) 1Kb. Conversely, the 
old-style code performs no allocations besides that of the {{Indexer}}.

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sylvain Lebresne (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706556#comment-14706556
 ] 

Sylvain Lebresne commented on CASSANDRA-9459:
-

Alright, follows my comments for the rest of the patch review:
* Would be nice to move the calls of {{SecondaryIndexManager.validate}} out of 
{{UpdateParameters}}/{{CassandraServer}} to a single place, like maybe 
{{Keyspace.apply}} so it's in only one place and we're sure we cannot miss it 
on any path.
* I would remove {{ColumnIndexMetadata}}, inlining the fields directly into 
{{ColumnIndex}} (with proper getters). I think the indirection makes things a 
bit harder to follow (and more verbose) for no benefit I can see (typically we 
store the {{baseCfs}} twice, in both {{ColumnIndex}} and 
{{ColumnIndexMetadata}}. Or {{CassandraIndexSearcher}} holds bothe the 
{{CassandraIndex}} and {{CassandraIndexMetadata}} even though they are 
essentially the same thing). If you really hate removing it (but I _really_ 
think it would be cleaner to do so), I think we should at least rename it, 
cause it currently sounds like it's a specialisation of {{IndexMetadata}} while 
it's not (and more generally it doesn't fit into what we call Metadata in 
general).
* Still not a fan of {{ColumnIndexFunctions}}. I think having subclasses of 
{{CassandraIndex}} would be a lot more idiomatic (since this, to me, fits the 
exact definition of what inheritance is about) and hence simpler (since more 
direct/expected). Would also be a tiny bit less verbose since you won't have to 
basically pass the index to most function.
* To have {{CassandraIndexSearcher}} be an {{Index.Searcher}} but 
{{ColumnIndexSearcher}} (and its subclasses) not be one is a bit 
surprising/inconsistent naming wise. And really, having 2 separate class here 
feels more complexity than necessary (after all, {{CassandraIndex}} is our 
ColumnIndex implementation so why 2 searcher class?). So I would just merge 
{{ColumnIndexSearcher}} into {{CassandraIndexSearcher}} (having thus 
{{KeysSearcher}} and {{CompositesSearcher}} be sublcasses of 
{{CassandraIndexSearcher}}).
* In {{ColumnFamilyStore.scrubDataDirectories}}, we basically use 
{{!index.isCustom()}} to select index that have a backing CFS. Why not using 
SecondaryIndexManager.getAllIndexStorageTables() instead, that's more 
consistent with how we do it in other places (I'll note you've mostly just kept 
the way it was done before, but could still be nice to clean it imo).

I've also pushed [a 
commit|https://github.com/pcmanus/cassandra/commits/9459-nits] with a few very 
minor nits that would have taken longer to explain that do.


 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706745#comment-14706745
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


Thanks [~slebresne], I appreciate it's a bit of a slog. I've pushed fixes that 
address most of your first set of points (too many to enumerate so I'll only 
mention the ones which I didn't just fix, but there's basically a new commit 
per-point in the branch if you want to verify any in particular).

bq. {{getBestIndexFor}} seems to now favor indexes that handle more of the 
expressions. I'm not convinced by that heuristic

The metric isn't so much how many of the original expressions can an Index 
handle, but the amount of filtering left to do on the results of the index 
lookup. It just so happens that all our current (built in) index 
implementations only reduce any filter to the degree of removing 1 expression 
(at most). In it's current incarnation {{RowFilter}} is limited to describing 
only a very simple expression tree, one with depth of 1 and containing only AND 
relations. As an effort to future-proof the Index interface somewhat I thought 
transforming the original filter into one representing what the index *can't* 
do would be a decent heuristic. The naive comparison in {{SIM}} which evaluates 
the reduced filters by number of expressions could safely be changed later 
without breaking custom index implementations (at least, that was the 
intention). The merits of attempting to future proof vs YAGNI are of course 
debatable, so if we agree this needs more consideration then I'll happily 
revert the selection criteria to simply consider {{estimateResultRows}}.

bq. The initialization of an {{Index}} bothers me a bit

bq. As for registration, index can do it directly in the ctor by calling 
{{baseCfs.indexManager.register()}} (we can also specify they have to do it). 

I'm not a fan of enforcing that kind of thing just by specification. Also, IoC 
is more explicit, cleaner and reduces dependencies. I *would* also argue that 
it improves testability but as I'm not actually testing this anywhere, I won't. 
Either way, I'm fine with requiring a constructor with a specific signature  
removing the {{init}} method, but I'd prefer to keep {{register}} separate if 
you don't object too strongly.
Aside from that, I agree with all your points regarding 
construction/init/reload etc. On caveat is that {{SIM#reload}} can't just 
reload the indexes it knows about already, it has to check that every index 
defined in its base CFS's metadata is present. In the schema update from 
executing {{CreateIndexStatement}}, only the {{CFMetaData.indexes}} is updated, 
not the {{ColumnFamilyStore}} itself, which is just reloaded. When that reload 
happens then, the {{SIM}} needs to add the new index which is present in 
{{CFM.indexes}}, but not already registered. With that exception, I've made 
those changes in {{454bba7708018f872df825103aa52a99c8f653bd}}.

bq. Not a fan of using {{Optional}} as return type of 
{{Index.getReducedFilter}} as I would have expected intuitively that an empty 
optional would means the whole filter is reduced

As noted above, this whole approach to selection is up for debate so it may be 
that this simply becomes not a problem. That said, I would argue against a 
couple of your points here; I don't agree that an empty optional intuitively 
implies the total reduction of the filter, in that case an empty filter, not an 
empty optional, would seem most semantically correct to me. Secondly, it feels 
fragile to make assumptions about object equality in this way, especially in an 
extension point like this. I would rather not depend on documentation to 
enforce this sort of thing. Prior art for doing this *kind* of thing in C* is 
to return null when there's caller cannot satisfy a request and so using 
{{Optional}} instead seems pretty reasonable to me.

bq. I also don't find the {{Index.getReducedFilter}} naming too intuitive. I'd 
have prefer something like {{Index.getUnhandledExpressions}}

Again, the point of the method isn't necessarily to identify which of 
*original* expressions are unhandled, isn't it conceivable that a custom index 
could radically transform a filter into one containing an entirely disjoint set 
of expressions from the original?

bq. In {{CassandraIndex.indexFor}}, the implementations of {{insertRow}} and 
{{removeRow}} seems dangerous to me..if insertRow() is called with some 
tombstone, it will insert the cells instead of removing them for 
instance{{insertRow}} and {{updateRow}}.

Ok that's a good point, but can you clarify do you mean regular the row may 
contain regular tombstones (i.e. the result of DELETE col FROM table WHERE 
...?) If that is the case, then yes there is a bug currently in that we will 
try to insert an index entry for the (empty) value of the tombstone. I've

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sylvain Lebresne (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706916#comment-14706916
]

Sylvain Lebresne commented on CASSANDRA-9459:
-

bq. According to a previous comment by Sylvain Lebresne, the limit should be
now applied after post reconciliation, but it seems to me the limit is actually
applied twice

You're correct and both case are unintentional. I added both
{{CountingPartitionIterator}} relatively late on my iteration on CASSANDRA-8099
and didn't realized I was breaking this. I've push a [simple
commit|https://github.com/pcmanus/cassandra/commits/9459-nits] to fix both
instances but basically:
* inside {{RangeCommandIterator#sendNextRequests}}, we didn't really cared
about enforcing the limit, we just use the {{CountingPartitionIterator}} to
count results for the sake of potentially updating our concurrency factor
estimate. So I've made sure we don't enforce any limit there.
* in {{StorageProxy#getRangeSlice}}, that was kind of a typo, we do want to
switch the call to the {{CountingPartitionIterator}} and to
{{postReconciliationProcessing}}.

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

For some time now the index subsystem has been a pain point and in large part
this is due to the way that the APIs and principal classes have grown
organically over the years. It would be a good idea to conduct a wholesale
review of the area and see if we can come up with something a bit more
coherent.
A few starting points:
* There's a lot in AbstractPerColumnSecondaryIndex its subclasses which
could be pulled up into SecondaryIndexSearcher (note that to an extent, this
is done in CASSANDRA-8099).
* SecondayIndexManager is overly complex and several of its functions should
be simplified/re-examined. The handling of which columns are indexed and
index selection on both the read and write paths are somewhat dense and
unintuitive.
* The SecondaryIndex class hierarchy is rather convoluted and could use some
serious rework.
There are a number of outstanding tickets which we should be able to roll
into this higher level one as subtasks (but I'll defer doing that until
getting into the details of the redesign):
* CASSANDRA-7771
* CASSANDRA-8103
* CASSANDRA-9041
* CASSANDRA-4458
* CASSANDRA-8505
Whilst they're not hard dependencies, I propose that this be done on top of
both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the
storage engine changes may facilitate a friendlier index API, but also
because of the changes to SIS mentioned above. As for 6717, the changes to
schema tables there will help facilitate CASSANDRA-7771.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707001#comment-14707001
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


Ok [~slebresne], the most recent commit 
([afc4ea1|https://github.com/beobal/cassandra/commit/afc4ea1468a44692ef36498e2acf36a12a104bc8])
 reworks the class hierarchy around {{CassandraIndexer}} along the lines of 
your suggestions. {{CassandraIndex}} is now an abstract class, with concrete 
subclasses representing the various specializations. I've kept the functions 
(renamed to {{CassandraIndexFunctions}}, but these are now fairly minimal. 
They're used where we need to do some index-type-specific thing, without having 
an instance of the index around (like creating a CFMetaData for the backing 
store for example). I've also rolled {{ColumnIndexSearcher}} and 
{{CassandraIndexSearcher}} together, which was eminently sensible,  removed 
{{ColumnIndexMetadata}}.


 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707014#comment-14707014
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


bq.  In {{ColumnFamilyStore.scrubDataDirectories}}, we basically use 
{{!index.isCustom()}} to select index that have a backing CFS. Why not using 
SecondaryIndexManager.getAllIndexStorageTables() instead

Basically because the method is static and so we don't have a {{SIM}}, only a 
{{CFMetaData}}  its {{Indexes}}



 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sam Tunnicliffe (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706940#comment-14706940
]

Sam Tunnicliffe commented on CASSANDRA-9459:

bq. . I wouldn't mind future-proofing per-se, but I'm saying that considering
the number of expressions left by an index before considering the estimate of
the number of rows it returns is just a bad heuristic, now or in the future.

Ok, fair enough. I'll revert {{SIM}} back to selecting purely based on the
estimated result rows for now.

bq. having a tombstone in a row passed to insertRow() would have triggered the
insertion of an index entry,

And that's exactly what was happening, I'll update {{removeCell}} too.

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sylvain Lebresne (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706848#comment-14706848
 ] 

Sylvain Lebresne commented on CASSANDRA-9459:
-

bq. The metric isn't so much how many of the original expressions can an Index 
handle, but the amount of filtering left to do on the results of the index 
lookup.

Doesn't really change my point :). My point is that the filtering post-index 
query is relatively cheap, and what we want to favor first and foremost is to 
minimize the number of results returned by the index. If an index A returns 10 
rows with 3 more expressions to check and another index B returns 1000 rows 
with only 1 expression to check, you want to choose A every single time, not B 
(unless of course B is able to return those 1000 rows more than 100x more 
quickly that A takes to return its 10 rows but we have absolutely no metric 
regarding that). I wouldn't mind future-proofing per-se, but I'm saying that 
considering the number of expressions left by an index _before_ considering the 
estimate of the number of rows it returns is just a bad heuristic, now or in 
the future.

bq. if we agree this needs more consideration

Per what's above, I think it does.

bq. I'd prefer to keep register separate if you don't object too strongly.

I don't.

bq. isn't it conceivable that a custom index could radically transform a filter 
into one containing an entirely disjoint set of expressions from the original?

Sure, why not. Still, if the point is that the filter can radically change, 
then Reduced is still a misleading name. What about something like 
{{getPostIndexQueryFilter}}?

Regarding the {{Optional}} return of that method, I still don't like it (I 
wouldn't like a {{null}} return either), because it basically feels we're 
adding a meaning to the method (whether the filter is handled at all by the 
index) which is not really implied by the method name. In other words, I think 
the clean way would be to have a separate boolean {{isHandled()}} method, but I 
understand this would duplicate work and we want to avoid this. So anyway, not 
a big deal, let's stick with the {{Optional}}.

bq. I don't quite get what you mean by it will insert the cells instead of 
removing them for instance, as the tombstone has no value there's nothing we 
can remove

Well, internally a tombstone has a value, it's an empty byte buffer :) (so in 
practice, having a tombstone in a row passed to {{insertRow()}} would have 
triggered the insertion of an index entry, albeit a broken entry).

Anyway, I really did just mean that we should skip tombstones as is currently 
done in the {{SecondaryIndexManager.Updater}} implementations but wasn't done 
in the patch. And while I see you've added it to {{indexCell}}, we need it in 
{{removeCell}} too (nitpick: I would add the {{isLive}} test as a or with the 
{{cell == null}} test).

bq. I also think we {{removeRow}} is still needed for cleanup  compaction

Sure, you're right of course, I brainfarted.

bq. I'm not sure what you mean about 'idxName', looks like it's used to me

Meant that it wasn't used in the equality test, which was 
{{indexer.getIndexName().equals(cfName)}} and not, as I would have expected 
{{indexer.getIndexName().equals(idxName)}}. Anyway, your new version doesn't 
have this problem so I'm good.

bq. abstracting the registry-ness from {{SIM}} makes it much easier to use a 
lightweight implementation for tests

Fair enough. I guess if we keep {{register}} then that makes a bit more sense. 
Wasn't a big deal anyway, it's not illogical at all, I just wanted to be sure I 
wasn't missing something.


 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sylvain Lebresne (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707334#comment-14707334
]

Sylvain Lebresne commented on CASSANDRA-9459:
-

bq. Basically because the method is static and so we don't have a {{SIM}}, only
a {{CFMetaData}} its {{Indexes}}

Of course, sorry for missing that part.

So, at the time of this writing the branch just misses the exclusion of
tombstones in {{removeCell}}, but as soon as that's in and assuming cassci is
happy too (will be nice to link them here for posterity), I'm +1 on the branch.
Thanks for turning the later changes so quickly and great work overall.

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707074#comment-14707074
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


bq.  I'll revert SIM back to selecting purely based on the estimated result rows

Done in 
[5b756a3|https://github.com/beobal/cassandra/commit/5b756a304b6fd7f8a5e8acc97ac144bfe948486b]

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-21 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707657#comment-14707657
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


Thanks, committed to 3.0 in {{0626be8667aefdcf50a051471f83da90bbae9bcf}}
3.0 
[testall|http://cassci.datastax.com/view/Dev/view/beobal/job/beobal-9459-wip-testall/lastCompletedBuild/testReport/]
  
[dtests|http://cassci.datastax.com/view/Dev/view/beobal/job/beobal-9459-wip-dtest/lastCompletedBuild/testReport/]
 
trunk 
[testall|http://cassci.datastax.com/view/Dev/view/beobal/job/beobal-9459-wip-trunk/testall/lastCompletedBuild/testReport/]
 
[dtests|http://cassci.datastax.com/view/Dev/view/beobal/job/beobal-9459-wip-trunk/dtest/lastCompletedBuild/testReport/]

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-20 Thread Sylvain Lebresne (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705288#comment-14705288
 ] 

Sylvain Lebresne commented on CASSANDRA-9459:
-

Haven't finished to review all of the patch, but I'm going to give a first 
batch of remarks/suggestions/review points. I'll note that while there is quite 
a few, they are mostly relatively small stuff: the bulk of the patch looks 
pretty good, so good job [~beobal].

* In {{SecondaryIndexManager}}:
** {{flushIndexesBlocking}} holds a lock on {{baseCfs.getTracker()}} for the 
whole duration of the flush. I don't think that's what we want. I think what we 
want is to hold the lock while we _submit_ the flush, but not for the whole 
time of the flushing.
** {{getBestIndexFor}} seems to now favor indexes that handle more of the 
expressions. I'm not convinced by that heuristic. It's totally possible for an 
index to handle less expression but be a lot more selective.  So I think we 
should stick to only considering {{estimateResultRows}} (as we're currently 
doing unless I'm missing something). If anything else, I'd rather not do that 
kind of change in this refactoring ticket.
** In {{WriteTimeTransaction.onUpdated}}, I don't think we should ignore the 
{{onPrimaryKeyLivenessInfo}} call: we could be updating a TTL on only the 
clustering columns and that should be carried out to the index.
** {{IndexGCTransaction.onRowMerge}} seems to do more work than it should. I 
believe all we want to do during compaction is remove cells that have been 
shadowed by some deletion (since we don't handle those at write time). But the 
code seems to also add any update (I'm saying imo the condition should be {{if 
(original != null  merged == null)}}). 
** In {{indexPartition}}, the static case should be inside the {{try()}}: no 
reason to filter normal rows but not the static one.
** Why do we need IndexAccessor, since that's created from a {{ReadCommand}} in 
the first place. Can't we just return an {{Index}}, and have the rest of the 
methods of IndexAccessor be methods of {{Index}} taking a {{ReadCommand}} 
(which they mostly already are anyway)? (would make {{ReadCommand.getIndex()}} 
method actually return an {{Index}}, which is a little bit more consistent).
** Should probably add a {{if (!hasIndexes())}} test on top of 
{{newUpdateTransaction}}: that's a very common case and a very hot path and 
currently even with no index I think we'll still do a bunch of work (including 
allocating an empty array).
** {{CleanupTransaction}} should be split up in 2 since Cleanup and Compaction 
use of it don't overlap in what they use and that's a bit confusing. I'd create 
3 interfaces: {{UpdateIndexTransaction}}, {{CleanupIndexTransaction}} and 
{{CompactionIndexTransaction}}. I'd also make those top-level interface to 
avoid the long {{SecondaryIndexManager}} everywhere (the concrete 
implementations can stay where they are). We could also have a 
{{IndexTransaction}} (that they all extend and have just start() and commit()) 
to put inside the {{TransactionType}} (just because {{IndexTransaction.Type}} 
looks better than {{SecondaryIndexManager.TransactionType}} :)).
** Is there a reason for using the whole {{IndexMetadata}} as map key in the 
{{indexes}} map? It feels that using the index name should be enough (since we 
guarantee it's unique and fixed) and would make looking a tad faster since 
there is less to hash/compare and might avoid building fake {{IndexMetadata}} 
just for lookup. Certainly feels cleaner to me in principle.
** I'd rename {{getAllIndexStorageTables}} to {{getAllIndexColumnFamilyStore}}: 
not sure it's worth adding the new verbiage StorageTables (note that I hope 
we'll soon rename {{ColumnFamilyteStore}} to {{TableStore}} and rename that 
method accordingly, but it's better to rename consistently for now and deal 
with that later imo).
* In {{Index}}:
** The initialization of an {{Index}} bothers me a bit: the fact there is 
basically 3 calls ({{init()}}, {{setIndexMetadata()}} and then {{register()}}) 
make it hard to understand what initialization actually does. It also means 
nothing can be final in the implementations even if it kind of should (at least 
for {{baseCfs}} in {{CassandraIndex}}). I haven't tested it so I might miss 
some detail, but what I could suggest would be to pass the base table CFS and 
the initial {{IndexMetadata}} to the ctor (so for custom index, we'd specify 
they should have a ctor expecting those) and we'd then just have a 
{{initializationTask()}} that return what needs to be done initially. As for 
registration, index can do it directly in the ctor by calling 
{{baseCfs.indexManager.register()}} (we can also specify they have to do it). 
Now, it's true that {{setIndexMetadata}} is also called during CFS reload, but 
that leads me to another point: it's a bit misleading imo that the index can't

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-19 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703314#comment-14703314
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


I've pushed some further commits which I think make this ready for another 
round of review (pinging [~slebresne])

Outstanding issues (some of which may be split out to follow up tickets):

* I'd particularly like some feedback on the way that 
{{SIM.WriteTimeTransaction#onUpdated}} and {{Index#updateRow}} handle Memtable 
updates. Rather than pass the existing, update and reconciled versions of the 
row to every registered Index, and have each of them potentially perform a very 
similar set of diff operations, the diffing is done once (in the transaction) 
and only the deltas are passed to the {{Indexes}} - one row containing the 
subset of the existing row that is now gone, another with the subset of the 
merged row that was added in the operation. This feels a bit counter-intuitive 
and writing the javadoc on {{Index#updateRow}} was tough, so I'm a bit 
concerned that this is going to be tricky for implementors to work with (but I 
don't want to duplicate the diffing if we can avoid it).
* Provide a better API for (re)building indexes. The current approach assumes 
that indexes should always be built from the merged view of data in SSTables, 
but this may not always be the case. That said, this is true for most existing 
implementations, and so is optimised to perform only a single pass through the 
data. I don't want to prohibit that optimisation, so some further thought it 
required.
* Lookups in {{SecondaryIndexManager}} could certainly be improved with some 
better datastructures, rather than always resorting to a scan through the 
entire collection of registered indexes
* There is a mismatch between the name of an index as stored in schema and in 
the value returned from Index.getIndexName, which for the builtin index impls 
is the name of the underlying index CFS. This leaks into a number of places, 
notably around (re)building indexes. I've opened CASSANDRA-10127 for this.
* I'm not entirely happy with the way we validate restrictions using 
{{Index.supportsExpression}}. It seems a bit blunt, but I haven't been able to 
come up with anything better yet.


I've avoided squashing the [wip 
branch|https://github.com/beobal/cassandra/tree/9459-wip] since people have 
already commented on that, but I have had to rebase it several times so 
although the commit history has been overwritten, it's remained more or less 
semantically consistent.


 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-19 Thread Sam Tunnicliffe (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703305#comment-14703305
]

Sam Tunnicliffe commented on CASSANDRA-9459:

@sbtourist in response to your comments (sorry for the delay) :

bq. It seems we've lost CASSANDRA-9196.

This was necessary because of the fact that each Index defined in schema was
automatically registered with {{SecondaryIndexManager}}. So even if a
particular custom index would not participate in any indexing or search
activity on a certain node, due to external configuration or whatnot, its mere
presence would mean that whenever new SSTables were loaded we would perform an
expensive, and possibly pointless iteration through them. This shouldn't happen
anymore, as the decision whether to register an index is now the responsibility
of the index itself, so it can make that choice based on whatever criteria is
necessary.

bq. It would be useful to distinguish between a cleanup and a compaction at the
Indexer level, as indexes not backed by CFs will probably be do nothing during
compaction.

{{SecondaryIndexManager.TransactionType}} now allows impls to distinguish
between {{WRITE_TIME}}, {{COMPACTION}} and {{CLEANUP}} transactions.

bq. Cells#reconcile doesn't call Indexer#updateCell in case of counters, but
what if a third-party implementation wants to index them?

Indexes are not supported on counter columns directly. That said, the latest
version changes the way updates are collected by {{WriteTimeTransaction}} with
the effect that counter columns will be present in the Rows supplied to
registered indexers.

bq. SIM#indexPartition seems to miss to invoke Indexer#finish.

Thanks, good catch.

On the subsequent comment regarding CASSANDRA-8717, I haven't had a chance yet
but I'll dig further into that shortly.

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-11 Thread Sergio Bossa (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681642#comment-14681642
 ] 

Sergio Bossa commented on CASSANDRA-9459:
-

The search path looks good as well, except I think CASSANDRA-8717 is actually 
broken, probably because of CASSANDRA-8099, but I think it's worth discussing 
here.

According to a previous comment by [~slebresne], the limit should be now 
applied *after* post reconciliation, but it _seems_ to me the limit is actually 
applied *twice*, and both times in a way that IMHO breaks the continuous 
range iteration required by Stratio, and generally any top-k implementation:
1) It is applied via {{CountingPartitionIterator}} while sending concurrent 
range requests in {{RangeCommandIterator#sendNextRequests}}: this means to me 
that each range (actually, the concatenation of all concurrently queried ones) 
will limit its returned result set, which prevents to correctly implement top-k 
queries (unless you can top-k sort on each replica).
2) It is further applied in {{StorageProxy#getRangeSlice}} via another 
{{CountingPartitionIterator}}, which will pass a limited iterator down to the 
{{Index}} post-processor {{BiFunction}}.

That said, my knowledge of CASSANDRA-8099 isn't deep, so I might be missing 
something in my analysis.

I'll now proceed with a last round of review and get back with some final 
feedback.

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-11 Thread Sergio Bossa (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14682108#comment-14682108
 ] 

Sergio Bossa commented on CASSANDRA-9459:
-

Last review round done, no other relevant remarks regarding the current code. 

One last feedback note is I'd also like to query the index by something else 
other than just the indexed column(s), either as part of this issue or another 
one, and I'm fine with [~beobal]'s proposed solution of referencing it by name.

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-10 Thread Sergio Bossa (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680829#comment-14680829
 ] 

Sergio Bossa commented on CASSANDRA-9459:
-

[~beobal],

I've made a first review pass, for now focused on the management and write 
paths: things seem good overall and the API is definitely nicer, I just have 
the following remarks so far:
* It seems we've lost CASSANDRA-9196.
* It would be useful to distinguish between a cleanup and a compaction at the 
{{Indexer}} level, as indexes not backed by CFs will probably be do nothing 
during compaction.
* {{Cells#reconcile}} doesn't call {{Indexer#updateCell}} in case of counters, 
but what if a third-party implementation wants to index them?
* {{SIM#indexPartition}} seems to miss to invoke {{Indexer#finish}}.
* Nit: comments and variable names refer to {{Indexer}} in several ways: 
handler, updater, index, indexer...

I'll hopefully finish reviewing tomorrow and get back with some more feedback.

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-09 Thread Robert Stupp (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14663284#comment-14663284
]

Robert Stupp commented on CASSANDRA-9459:
-

Will the new API also provide a unique ID per 2i and base table? Currently we
have {{UUID cfId}} unique per base table. The only way to identify a 2i (and
distinguish from the base table) is to check whether a {{.}} is in cfName as
done throughout the code using {{.contains(.)}}. (I'm not sure how far this
is addressed in CASSANDRA-9712)

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-09 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679108#comment-14679108
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


No plans for unique ids for 2i tables here, because at the moment I'm just 
focusing on getting the API changes done for 3.0. There are obviously a lot of 
changes to the implementation, but I'd like to defer any further non-essential 
ones to separate tickets if possible.

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-04 Thread Aleksey Yeschenko (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653688#comment-14653688
]

Aleksey Yeschenko commented on CASSANDRA-9459:
--

bq. but I think that it should be a way to add custom query syntax.

That's extremely out of scope of this ticket, sorry.

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-04 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653751#comment-14653751
 ] 

Andrés de la Peña commented on CASSANDRA-9459:
--

It seems perfect to me, much better than the previous approach. 

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-04 Thread JIRA

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653718#comment-14653718
]

Andrés de la Peña commented on CASSANDRA-9459:
--

The current link between the row index and a specific column allows custom
implementations to parse the linked column as custom syntax, as DSE Search,
Stratio and Tuplejump do. I wonder how this feature could be preserved if the
link between the row index and a specific column disappears. I did not mean
anything beyond this.

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-04 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653737#comment-14653737
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


Possibly by allowing the index itself to be referenced in the where clause, so 
in your example above you wouldn't be required to add a dummy {{lucene}} column 
to associate the index with, you could just do something like:
{code}
SELECT * FROM tweets WHERE my_custom_index='{..}' 
{code}

This is just a suggestion btw, so alternative ideas are welcome.

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-04 Thread Aleksey Yeschenko (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653901#comment-14653901
]

Aleksey Yeschenko commented on CASSANDRA-9459:
--

Hopefully we will have SASI in C* soon, and that will extend the supported
syntax greatly (OR, AND, NOT, groups). Then 2i implementors like Stratio,
Tuplejump, and DSE Search will just be able to reuse that syntax.

It wouldn't cover your original request exactly. Just wanted to point out the
probable near-future changes that affect the ticket.

Maybe some special built-in function is what we need to cover your case.

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-08-04 Thread Jonathan Ellis (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653870#comment-14653870
]

Jonathan Ellis commented on CASSANDRA-9459:
---

First reaction: I'd rather use some kind of function call syntax so that it's
distinct from normal columns.

Second reaction: Not sure conflating with UDF is much better. Maybe need to
think on this some more.

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-07-27 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642709#comment-14642709
 ] 

Andrés de la Peña commented on CASSANDRA-9459:
--

[~slebresne], [~beobal], I'm still not familiarized with the 3.0 changes but, 
as far as I understand, the iterator passed to {{postReconciliationProcessing}} 
allows it to read {{n}} rows of each one of the implied ranges. Thus, it's a 
much cleaner way to perform the top-key feature. However, I have doubts about 
the concurrency factor, which depends on the {{estimateResultsPerRange}}. Given 
that top-key queries always require to scan all ranges, I think that it would 
be better to fix it to the number of ranges, if I am not missing anything.

[~beobal], the new API looks great. I especially like the method 
{{updateRow(Row oldRow, Row newRow)}}. True per-row indexes, not linked to any 
specific column, are a big win. Adding support for more operators like OR is a 
good idea, but I think that it should be a way to add custom query syntax. 
Currently we are using column-linked queries as:
{code:sql}
SELECT * FROM tweets WHERE lucene='{
filter : {type:boolean, must:[
   {type:range, field:time, lower:2014/04/25, 
upper:2014/05/1, pattern:/MM/dd},
   {type:prefix, field:user, value:a} ] },
query  : {type:phrase, field:body, value:big data gives 
organizations, slop:1, max_expansions:1},
sort   : {fields: [ {field:time, reverse:true} ] }
}' limit 100; 
{code}
I'm wondering how it can be done with the new approach.

Another interesting idea that I don't know if it has been already addressed in 
3.0, is to support paging over indexes returning results in an order different 
to those defined by the partitioner and the column name. In Cassandra 2.x it's 
problematic because the last row key is used as the start of the next page 
{{DataRange}}, whereas it would be preferable to have {{DataRage}} containing 
both the original key range requested by the user and the last key of the last 
page. Currently we are addressing it with a custom, ugly {{QueryHandler}}, but 
it would be a nice feature to have a more generic support for this, unless it 
already exists.

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-07-27 Thread Sam Tunnicliffe (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642412#comment-14642412
]

Sam Tunnicliffe commented on CASSANDRA-9459:

bq. I think we're good, at least it's the intention.

Great, thanks for clearing that up [~slebresne].

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-07-26 Thread Sylvain Lebresne (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641921#comment-14641921
 ] 

Sylvain Lebresne commented on CASSANDRA-9459:
-

bq. I believe CASSANDRA-8717 is/was broken by CASSANDRA-8099. The post 
reconcilliation processing step is still there, but it looks like the code for 
scanning all ranges was removed from StorageProxy.

I think we're good, at least it's the intention. The scan all ranges option 
pre-CASSANDRA-8099 is just a ugly to ask for the code to not respect the user 
limit before the post-reconciliation function is called, since the limit is 
only thing that makes us stop scanning all ranges. However, 
post-CASSANDRA-8099, the user-limit is enforce _after_ the post-reconciliation 
call. So an implementation that want to use CASSANDRA-8717 can consume as much 
of the iterator passed to the post-reconciliation function as it wants/needs, 
and it will get all ranges if it consumes it all in particular. In other words, 
we now support CASSANDRA-8717 with just the post-reconciliation function, but 
that's a feature since it's cleaner.

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-07-25 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641503#comment-14641503
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:



I've pushed a branch [here|https://github.com/beobal/cassandra/tree/9459-wip] 
with some of the proposed api changes for this ticket. 
This is a fairly large patch, so I'll try to summarise the main changes below, 
but the key places to look at are in the {{org.apache.cassandra.index}} 
package, in particular.

* {{o.a.c.index.Index}}
* {{o.a.c.index.SecondaryIndexManager}}
* {{o.a.c.index.internal.CassandraIndexer}}

This patch is most definitely a work in progress, but I'd appreciate some 
feedback, especially on the general approach and high level API changes. 
[~sbtourist], [~adelapena]  [~xedin] in particular, I know you are likely have 
opinions on this, which would be good to hear.


h3. Flattened class hierarchy

Instead of:

{noformat}
SecondaryIndex
   ___|
  ||
  PerRowSecondaryIndex   PerColumnSecondaryIndex
   |
  AbstractSimplePerColumnSecondaryIndex
___|___
   |   | 
KeysIndexCompositesIndex
   |___
  ||   |
   CompositesIndexOnX   CompositesIndexOnY  
CompositesIndexOnZ 
{noformat}

We just have a single {{Index}} interface, with 2 inner interfaces {{Indexer}} 
and {{Searcher}}.
The specific differences between indexes on different types of columns in 
composite tables (i.e. all the {{CompositesIndexOnX}} implementations) have 
been abstracted into a set of stateless functions, defined in the 
{{ColumnIndexFunctions}} interface  with implementations for use with the 
various column types. As such, there is now just single {{Index}} 
implementation for all built-in indexes, {{CassandraIndex}} (I'm not sold on 
this name, but it follows precedent set by {{CassandraAuthorizer}} and 
{{CassandraRoleManager}}). 
A nice side effect is that {{KEYS}} indexes (for thrift/compact tables and, in 
CASSANDRA-8103, static column indexes) also fit into this pattern, so no need 
for another specialisation there. There are still separate searcher 
implementations for {{KEYS}} and {{COMPOSITES}} indexes, but there's a lot more 
commonality between them now (not as a result of this patch, that's an artifact 
of CASSANDRA-8099).

h3. Event driven, partition scoped updates

Instead of delivering updates to an index implementation per-partition (as 
previously with PRSI) or per-cell (PSCI), the write component of the index api 
is more closely aligned to a partition update of the underlying base data.

More specifically, when a partition is updated (either via a regular write, or 
during compaction) a series of events are (or may be) fired. An {{Index}} 
implementation is required to provide an event listener, whose interface is 
defined in {{Index.Indexer}}, to handle these events. The granularity of these 
events maps to a PartitionUpdate, so there are events that are fired on 
* partition delete
* range tombstone
* row inserted
* row updated
* row removed 

h3. Caveats/Missing/TBD/etc

* A major thing missing in this branch is CASSANDRA-7771 (multiple indexes 
per-column). Along with that, the plan is also to introduce true per-row 
indexes, where the index is not necessarily linked to *any* specific column. So 
until we start hashing that out a bit better, the way SIM represents the 
collection of Indexes is tbd.
* Related to that, once we've settled on how to define an Index's relationship 
with a Row (moving that out of ColumnDefinition), we can revisit caching  
lookup optimisation in SIM. Right now, every time we look up an index we do and 
filter of all the registered indexes for the table. We can definitely improve 
this and will do so ASAP.
* The mechanism by which we select indexes at query time remains pretty 
restrictive. The query clauses being represented as a list of 
{{RowFilter.Expression}} means only AND conjunctions are supported. This limits 
the scope for query optimisation and makes it difficult to extend search 
capabilities in the future, like adding support for OR for example. I'd like to 
move to something more expressive to give us scope to improve this area in 
future tickets.
* The validation methods on Index need some work. Basically these were simply 
copied from the existing implementation, but they ought to be reworked to 
combine them into a single {{validate(partition_update)}} or at least into 
{{validate(partitionkey)}} and {{validate(row)}}.
* The index transaction classes in

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-06-29 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605831#comment-14605831
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


At the moment, this is looking eminently doable on top of CASSANDRA-8099. In my 
WIP I have CASSANDRA-9041 working and CASSANDRA-7771 as close to working as 
possible without CASSANDRA-6717's changes to the underlying schema tables. In 
addition, I've reworked the main 2i API to make it primarily (CQL) row based, 
which should be a better fit for most of the known custom 2i implementations 
out there. 

Right now, the read  both write paths (write time  compaction) are basically 
working and I'm just troubleshooting some existing searcher issues on the main 
8099 branch. Once I'm done with that I'll post a summary of the proposed new 
API for review while I get on with building out the ancillary parts (rebuild 
and so forth) and improving test coverage.

As far as being able to utilise CQL internally in 2i implementations, it's not 
something I've looked at yet but I'm working on dummy index implementations to 
help validate the API, so I can use those to investigate.

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-06-16 Thread Bryn Cooke (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588157#comment-14588157
]

Bryn Cooke commented on CASSANDRA-9459:
---

It would be great if we could use CQL inside our custom secondary index
implementations as it would vastly increase the readability of our code rather
than generating slice queries manually. It's almost possible in Cassandra 2.x,
but I forget the exact reason it didn't work. Something about the table meta
data not being available during CQL validation.

SecondaryIndex API redesign
---

Key: CASSANDRA-9459
URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
Project: Cassandra
Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
Fix For: 3.0 beta 1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-05-22 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556335#comment-14556335
 ] 

Jonathan Ellis commented on CASSANDRA-9459:
---

bq. I propose that this be done on top of both CASSANDRA-8099 and CASSANDRA-6717

(As long as you mean the newly scope-limited 6717 and not everything pulled 
into CASSANDRA-9424.)

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

2015-05-22 Thread Sam Tunnicliffe (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556346#comment-14556346
 ] 

Sam Tunnicliffe commented on CASSANDRA-9459:


bq. (As long as you mean the newly scope-limited 6717 and not everything pulled 
into CASSANDRA-9424.)

That is exactly what I mean

 SecondaryIndex API redesign
 ---

 Key: CASSANDRA-9459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sam Tunnicliffe
Assignee: Sam Tunnicliffe
 Fix For: 3.0 beta 1


 For some time now the index subsystem has been a pain point and in large part 
 this is due to the way that the APIs and principal classes have grown 
 organically over the years. It would be a good idea to conduct a wholesale 
 review of the area and see if we can come up with something a bit more 
 coherent.
 A few starting points:
 * There's a lot in AbstractPerColumnSecondaryIndex  its subclasses which 
 could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
 is done in CASSANDRA-8099).
 * SecondayIndexManager is overly complex and several of its functions should 
 be simplified/re-examined. The handling of which columns are indexed and 
 index selection on both the read and write paths are somewhat dense and 
 unintuitive.
 * The SecondaryIndex class hierarchy is rather convoluted and could use some 
 serious rework.
 There are a number of outstanding tickets which we should be able to roll 
 into this higher level one as subtasks (but I'll defer doing that until 
 getting into the details of the redesign):
 * CASSANDRA-7771
 * CASSANDRA-8103
 * CASSANDRA-9041
 * CASSANDRA-4458
 * CASSANDRA-8505
 Whilst they're not hard dependencies, I propose that this be done on top of 
 both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
 storage engine changes may facilitate a friendlier index API, but also 
 because of the changes to SIS mentioned above. As for 6717, the changes to 
 schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

[jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign

35 matches

Site Navigation

Mail list logo

Footer information