[ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079735#comment-13079735 ]
Jason Rutherglen commented on CASSANDRA-2915: --------------------------------------------- bq. like getLuceneAnalyzer() There won't always be a 1 to 1 mapping of a column to a field. For example in Solr, there is copy field, which essentially creates a new field. Also Analyzer is for any field, the right per-field class would be Tokenizer. I strongly believe we need to have an interface that accepts a row and essentially generates a Lucene Document. This should be the most straightforward approach that enables just about anything, including using a Solr schema at some point. > Lucene based Secondary Indexes > ------------------------------ > > Key: CASSANDRA-2915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2915 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: T Jake Luciani > Labels: secondary_index > Fix For: 1.0 > > > Secondary indexes (of type KEYS) suffer from a number of limitations in their > current form: > - Multiple IndexClauses only work when there is a subset of rows under the > highest clause > - One new column family is created per index this means 10 new CFs for 10 > secondary indexes > This ticket will use the Lucene library to implement secondary indexes as one > index per CF, and utilize the Lucene query engine to handle multiple index > clauses. Also, by using the Lucene we get a highly optimized file format. > There are a few parallels we can draw between Cassandra and Lucene. > Lucene indexes segments in memory then flushes them to disk so we can sync > our memtable flushes to lucene flushes. Lucene also has optimize() which > correlates to our compaction process, so these can be sync'd as well. > We will also need to correlate column validators to Lucene tokenizers, so the > data can be stored properly, the big win in once this is done we can perform > complex queries within a column like wildcard searches. > The downside of this approach is we will need to read before write since > documents in Lucene are written as complete documents. For random workloads > with lot's of indexed columns this means we need to read the document from > the index, update it and write it back. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira