[ 
https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032102#comment-14032102
 ] 

Rustam Aliyev commented on CASSANDRA-6477:
------------------------------------------

In addition to performance, one of the key advantages of application-maintained 
global indexes is flexibility. I think it's important to preserve it in 
built-in global indexes. Few cases I think important to consider:

# Composite index. Global index can be based on more than one column.
# Range query on indexed elements. With high cardinality global index it would 
be efficient to allow range query on elements to make consecutive multiget 
efficient. For example, indexing time-series data by type and then looking up 
with {{... TYPE="type1" and ID > minTimeuuid('2013-02-02 10:00+0000')}}
# Reverse key index. Should be able to define index clustering key (i.e. 
indexed elements) order (ASC, DESC). Helpful when used with range queries above.
# Function based index. In this case, index is defined by transformation 
function. For example, lowercase(value) or arithmetic function like (field1 * 
field2).
# Storing data in index. Typically, global indexes have following structure 
where values are nulls:
{code}
"idx_table" {
   "index_value1" : {
       "el_id1" : null,
       "el_id5" : null,
       ...
   }
}
{code}
However, sometimes it's efficient and convenient to keep some information in 
values. For example, let's assume that elements above contains tens of fields. 
However, in 90% cases application uses only one of those e.g. hash. In that 
case, it's efficient to scan index and retrieve hash values directly from index 
instead of doing additional lookup to original table. Above table would looks 
like:
{code}
"idx_table" {
   "index_value1" : {
       "el_id1" : "74335a7c9229...",
       "el_id5" : "28b986fa29eb...",
       ...
   }
}
{code}

Traditional RDBMS support most of these indexes. For function based indexes we 
could create a bunch of functions in CQL3 (e.g. Math.*, LOWERCASE(), etc.) 
similar to other RDBMS.

Alternatively, we can achieve greater flexibility by storing optional Java 8 
lambda functions. Lambda function will take mutated row as an input and return 
2 vars:
# non-empty set of indexes (required)
# map of id -> value which will be used to lookup stored index values 
(optional). If element not found, null is stored.

{{CREATE INDEX}} statement has to define produced index CQL type and optionally 
stored index values:
{code}
CREATE GLOBAL INDEX account_by_email_idx ON accounts ( LAMBDA("row -> { return 
row.email.toLowerCase(); }") ) WITH INDEX_TYPE = {'text'};
{code}

More examples:
# Lowercase email: {code} row -> { return row.email.toLowerCase(); } {code}
# Distance between coordinates: {code} row -> { return 
Math.sqrt((row.x1-row.x2)*(row.x1-row.x2) + (row.y1-row.y2)*(row.y1-row.y2)); } 
{code}
# Conditional index: {code} row -> { return row.price > 0 ? "paid" : "free"; } 
{code}
# Indexes with values (item 5 above) may require some special return type (e.g. 
{{IndexWithValues}}). In the example above, message length will be stored in 
the index: {code} row -> { return new IndexWithValues(row.type, 
row.message.length()); } {code}

Querying these indexes is another caveat. Consider distance between coordinates 
example above - what would be SELECT statement for this index? With 
application-maintained global indexes, application can just lookup in index 
using given value. Same applies to indexes with stored values.

Without these, built-in global indexes will be very limited and once again, 
application-maintained global indexes would remain as go to solution.

> Global indexes
> --------------
>
>                 Key: CASSANDRA-6477
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6477
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>             Fix For: 3.0
>
>
> Local indexes are suitable for low-cardinality data, where spreading the 
> index across the cluster is a Good Thing.  However, for high-cardinality 
> data, local indexes require querying most nodes in the cluster even if only a 
> handful of rows is returned.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to