On Fri, May 16, 2014 at 10:53 AM, Kevin Burton <bur...@spinn3r.com> wrote:

> I'm struggling with cassandra secondary indexes since the documentation
> seems all over the place and I'm having to put together everything from
> blog posts.
>

This mostly-complete summary content will eventually make it into a blog
post :

"
Secondary Indexes in Cassandra
------------------------------------------

Users frequently come into #cassandra or the cassandra-user@ mailing list
and ask questions about Secondary Indexes. Here is my stock answer.

“Unless you REALLY NEED the feature of atomic update of the secondary index
with the underlying row, you are almost always better off just making your
own manual secondary index column family.”

In Cassandra, the unit of distribution is the partition (f/k/a “Row”). If
your query needs to scan multiple partitions and inspect each of their
contents, you have probably made a mistake in your data model. For queries
which interact with sets of partitions one should use executeAsync() w/ the
new CQL drivers, not multigets.

Advantages of Secondary Indexes :

- Atomic update of secondary index with underlying partition/storage row.
- Don’t have to be maintained manually, including automated rebuild.
- Provides the illusion that you are using a RDBMS.

Disadvantages of Secondary Indexes :

- Before 1.2, they do a read-before-write.
https://issues.apache.org/jira/browse/CASSANDRA-2897
- A steady trickle of occasionally-serious bugs which do not affect the
normal read/write path. [3]
- Bad for low cardinality cases. FIXME : detail (relates to checking each
node)
- Bad for high cardinality cases. FIXME : detail (certain cases? what about
equality/non-equality?)
- CFstats not exposed via nodetool cfstats before 1.2 :
https://issues.apache.org/jira/browse/CASSANDRA-4464 ?
- Lower availability than normal Cassandra read path. FIXME : citation
- Unsorted results, in token order and not query value order.
- Can only search on datatypes Cassandra understands.
- Secondary index is located in the same directory as the primary SSTables.
- Provides the illusion that you are using a RDBMS.
"

Readers will note that I am not very clear above on which cardinality cases
they *are* good for, because I consider all the other problems sufficient
to never use them.

=Rob
[1] Citations :

https://issues.apache.org/jira/browse/CASSANDRA-5502

https://issues.apache.org/jira/browse/CASSANDRA-5975

https://issues.apache.org/jira/browse/CASSANDRA-2897 - 2i without
read-before-write

https://issues.apache.org/jira/browse/CASSANDRA-1571 - (0.7) Secondary
Indexes aren't updated when removing whole row

https://issues.apache.org/jira/browse/CASSANDRA-1747 - (0.7) Truncate is
not secondary index aware

https://issues.apache.org/jira/browse/CASSANDRA-1813 - (0.7) return
invalidrequest when client attempts to create secondary index on
supercolumns

https://issues.apache.org/jira/browse/CASSANDRA-2619 - (0.8) secondary
index not dropped until restart

https://issues.apache.org/jira/browse/CASSANDRA-2628 - (0.8) Empty Result
with Secondary Index Queries with "limit 1"

https://issues.apache.org/jira/browse/CASSANDRA-3057 - (0.8) secondary
index on a column that has a value of size > 64k will fail on flush

https://issues.apache.org/jira/browse/CASSANDRA-3540 - (1.0) Wrong check of
partitioner for secondary indexes

https://issues.apache.org/jira/browse/CASSANDRA-3545 - (1.1) Fix very low
Secondary Index performance

https://issues.apache.org/jira/browse/CASSANDRA-4257 - (1.1) CQL3 range
query with secondary index fails

https://issues.apache.org/jira/browse/CASSANDRA-2897 - (1.2) Secondary
indexes without read-before-write

https://issues.apache.org/jira/browse/CASSANDRA-4289 - (1.2) Secondary
Indexes fail following a system restart

https://issues.apache.org/jira/browse/CASSANDRA-4785 - (1.2) Secondary
Index Sporadically Doesn't Return Rows

https://issues.apache.org/jira/browse/CASSANDRA-4973 - (1.1) Secondary
Index stops returning rows when caching=ALL

https://issues.apache.org/jira/browse/CASSANDRA-5079 - (1.1, but since
0.8) Compaction
deletes ExpiringColumns in Secondary Indexes

https://issues.apache.org/jira/browse/CASSANDRA-5732 - (1.2/2.0) Can not
query secondary index

https://issues.apache.org/jira/browse/CASSANDRA-5540 - (1.2) Concurrent
secondary index updates remove rows from the index

https://issues.apache.org/jira/browse/CASSANDRA-5599 - (1.2)
Intermittently, CQL SELECT  with WHERE on secondary indexed field value
returns null when there are rows

https://issues.apache.org/jira/browse/CASSANDRA-5397 - (1.2) Updates to
PerRowSecondaryIndex don't use most current values

https://issues.apache.org/jira/browse/CASSANDRA-5161 - (1.2) Slow secondary
index performance when using VNodes

https://issues.apache.org/jira/browse/CASSANDRA-5851 - (2.0) Fix 2i on
composite components omissions

https://issues.apache.org/jira/browse/CASSANDRA-5614 - (2.0) W/O specified
columns ASPCSI does not get notified of deletes

https://issues.apache.org/jira/browse/CASSANDRA-5920 - (2.0) Allow
secondary indexed columns to be used with IN operator
https://issues.apache.org/jira/browse/CASSANDRA-5975 - (1.2/2.0) Filtering
on Secondary Index Takes a Long Time Even with Limit 1, Trace Log Filled
with Looping Messages

Reply via email to