[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Stu Hood (JIRA) Sat, 27 Mar 2010 01:38:52 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850492#action_12850492
 ]


Stu Hood commented on CASSANDRA-749:
------------------------------------

> This isn't possible the "pretend index is a supercolumn row" approach.
I'm not sure that I understand why... can you give an example? The key in the 
pseudo CF would be the original indexed value, and each top level column in the 
index row would be a row from the base (from one node), so filtering within the 
base row could be applied locally on each node.

> multiget(rowpredicate, columnpredicate)* 
The rowpredicate containing an "index scan" parameter is very interesting, and 
does clarify slow operations. But, I can easily image a situation where someone 
wanted to use both a "named keys" and "index scan" rowpredicate at once, which 
would still be very efficient, but which would require a list<rowpredicate>.

I agree that placing the "index scan" predicate in the first position in the 
method call is essential, which is why I suggested the pseudo-CF api:

----

An interesting parallel is to compare the proposed api to Python's array 
slicing syntax, which is extremely elegant. I imagine that our ideal API is one 
that allows either named keys or a key range at every level of nesting. The 
following paragraphs only refer to key/name slicing, and don't go into 'value' 
queries.

As long as you concretely define a key or range of keys to search for at each 
level (such as [key1:key5][name1:name2][subname5]), your operation can run in 
bounded time. But, to provide for more flexibility, the get_range_slices method 
in the current API allows something like: [ ? ][name5] The question mark 
represents an unbounded level, which may mean a full table scan without finding 
'subname5' (very dangerous, not scalable). This is one of the places where we 
need secondary indexes: we want columns containing _any_ value for subname5 
bunched together into an index.

Comparing to the Python array API highlights the fact that prefix searches are 
always safe, and that by always having a parent predicate, you achieve bounded 
time operations. This is why placing the "index scan" predicate in the first 
position is so clear.

----

This brings us back to the pseudo-CF api: why have 3 types of rowpredicates, 
and 2+ types of columnpredicates when, by asking users to define views that 
shuffle their data into a form that allows for prefix queries, we can do 
something like:

multiget(list<predicate> predicates)

... with a predicate (key range or key list) required for every level, and only 
the last level allowing an unbounded predicate.

With this API, the "named keys" + "index scan" query I pointed out above would 
look like (with an indexed 'age' column):

multiget( [ predicate(key is 27), predicate(name in [ben, george]), 
predicate(subname is any) ] )

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, 
> views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Reply via email to