On Fri, May 16, 2014 at 10:53 AM, Kevin Burton <bur...@spinn3r.com> wrote:
> I'm struggling with cassandra secondary indexes since the documentation > seems all over the place and I'm having to put together everything from > blog posts. > This mostly-complete summary content will eventually make it into a blog post : " Secondary Indexes in Cassandra ------------------------------------------ Users frequently come into #cassandra or the cassandra-user@ mailing list and ask questions about Secondary Indexes. Here is my stock answer. “Unless you REALLY NEED the feature of atomic update of the secondary index with the underlying row, you are almost always better off just making your own manual secondary index column family.” In Cassandra, the unit of distribution is the partition (f/k/a “Row”). If your query needs to scan multiple partitions and inspect each of their contents, you have probably made a mistake in your data model. For queries which interact with sets of partitions one should use executeAsync() w/ the new CQL drivers, not multigets. Advantages of Secondary Indexes : - Atomic update of secondary index with underlying partition/storage row. - Don’t have to be maintained manually, including automated rebuild. - Provides the illusion that you are using a RDBMS. Disadvantages of Secondary Indexes : - Before 1.2, they do a read-before-write. https://issues.apache.org/jira/browse/CASSANDRA-2897 - A steady trickle of occasionally-serious bugs which do not affect the normal read/write path. [3] - Bad for low cardinality cases. FIXME : detail (relates to checking each node) - Bad for high cardinality cases. FIXME : detail (certain cases? what about equality/non-equality?) - CFstats not exposed via nodetool cfstats before 1.2 : https://issues.apache.org/jira/browse/CASSANDRA-4464 ? - Lower availability than normal Cassandra read path. FIXME : citation - Unsorted results, in token order and not query value order. - Can only search on datatypes Cassandra understands. - Secondary index is located in the same directory as the primary SSTables. - Provides the illusion that you are using a RDBMS. " Readers will note that I am not very clear above on which cardinality cases they *are* good for, because I consider all the other problems sufficient to never use them. =Rob [1] Citations : https://issues.apache.org/jira/browse/CASSANDRA-5502 https://issues.apache.org/jira/browse/CASSANDRA-5975 https://issues.apache.org/jira/browse/CASSANDRA-2897 - 2i without read-before-write https://issues.apache.org/jira/browse/CASSANDRA-1571 - (0.7) Secondary Indexes aren't updated when removing whole row https://issues.apache.org/jira/browse/CASSANDRA-1747 - (0.7) Truncate is not secondary index aware https://issues.apache.org/jira/browse/CASSANDRA-1813 - (0.7) return invalidrequest when client attempts to create secondary index on supercolumns https://issues.apache.org/jira/browse/CASSANDRA-2619 - (0.8) secondary index not dropped until restart https://issues.apache.org/jira/browse/CASSANDRA-2628 - (0.8) Empty Result with Secondary Index Queries with "limit 1" https://issues.apache.org/jira/browse/CASSANDRA-3057 - (0.8) secondary index on a column that has a value of size > 64k will fail on flush https://issues.apache.org/jira/browse/CASSANDRA-3540 - (1.0) Wrong check of partitioner for secondary indexes https://issues.apache.org/jira/browse/CASSANDRA-3545 - (1.1) Fix very low Secondary Index performance https://issues.apache.org/jira/browse/CASSANDRA-4257 - (1.1) CQL3 range query with secondary index fails https://issues.apache.org/jira/browse/CASSANDRA-2897 - (1.2) Secondary indexes without read-before-write https://issues.apache.org/jira/browse/CASSANDRA-4289 - (1.2) Secondary Indexes fail following a system restart https://issues.apache.org/jira/browse/CASSANDRA-4785 - (1.2) Secondary Index Sporadically Doesn't Return Rows https://issues.apache.org/jira/browse/CASSANDRA-4973 - (1.1) Secondary Index stops returning rows when caching=ALL https://issues.apache.org/jira/browse/CASSANDRA-5079 - (1.1, but since 0.8) Compaction deletes ExpiringColumns in Secondary Indexes https://issues.apache.org/jira/browse/CASSANDRA-5732 - (1.2/2.0) Can not query secondary index https://issues.apache.org/jira/browse/CASSANDRA-5540 - (1.2) Concurrent secondary index updates remove rows from the index https://issues.apache.org/jira/browse/CASSANDRA-5599 - (1.2) Intermittently, CQL SELECT with WHERE on secondary indexed field value returns null when there are rows https://issues.apache.org/jira/browse/CASSANDRA-5397 - (1.2) Updates to PerRowSecondaryIndex don't use most current values https://issues.apache.org/jira/browse/CASSANDRA-5161 - (1.2) Slow secondary index performance when using VNodes https://issues.apache.org/jira/browse/CASSANDRA-5851 - (2.0) Fix 2i on composite components omissions https://issues.apache.org/jira/browse/CASSANDRA-5614 - (2.0) W/O specified columns ASPCSI does not get notified of deletes https://issues.apache.org/jira/browse/CASSANDRA-5920 - (2.0) Allow secondary indexed columns to be used with IN operator https://issues.apache.org/jira/browse/CASSANDRA-5975 - (1.2/2.0) Filtering on Secondary Index Takes a Long Time Even with Limit 1, Trace Log Filled with Looping Messages