Hi Martin Yes that was helpful, thanks
(I had no idea you were reading the Cassandra users list! :-) ) Thanks, (Kaj) Magnus L On Mon, Sep 5, 2011 at 10:57 PM, Martin von Zweigbergk <martin.von.zweigbe...@gmail.com> wrote: > Hi Magnus, > > I think the answer might be on > https://issues.apache.org/jira/browse/CASSANDRA-749. For example, > Jonathan writes: > > <quote> >> Is it worth creating a secondary index that only contains local data, versus >> a distributed secondary index (a normal ColumnFamily?) > > I think my initial reasoning was wrong here. I was anti-local-indexes > because "we have to query the full cluster for any index lookup, since > we are throwing away our usual partitioning scheme." > > Which is true, but it ignores the fact that, in most cases, you will > have to "query the full cluster" to get the actual matching rows, b/c > the indexed rows will be spread across all machines. So, having local > indexes is better in the common case, since it actually saves a round > trip from querying a the index to querying the rows. > > Also, having each node index the rows it has locally means you don't > have to worry about sharding a very large index since it happens > automatically. > > Finally, it lets us use the local commitlog to keep index + data in sync. > </quote> > > Hope that helps, > Martin > > On Mon, Sep 5, 2011 at 1:52 AM, Kaj Magnus Lindberg > <kajmagnu...@gmail.com> wrote: >> Hi, >> >> (This is the 2nd time I'm sending this message. I sent it the first >> time a few days ago but it does not appear in the archives.) >> >> I have a follow up question on a question from February 2011. In >> short, I wonder why one won't have to query all Cassandra nodes when >> doing a secondary index lookup -- although each node only indexes data >> that it holds locally. >> >> The question and answer was: >> ( http://www.mail-archive.com/user@cassandra.apache.org/msg10506.html ) >> === Question === >> As far as I understand automatic secondary indexes are generated for >> node local data. >> In this case query by secondary index involve all nodes storing part of >> column family to get results (?) so (if i am right) if data is spread across >> 50 nodes then 50 nodes are involved in single query? >> [...] >> === Answer === >> In practice, local secondary indexes scale to {RF * the limit of a single >> machine} for -low cardinality- values (ex: users living in a certain state) >> since the first node is likely to be able to answer your question. This also >> means they are good for performing filtering for analytics. >> [...] >> >> === Now I wonder === >> Why would the first node be likely to be able to answer the question? >> It stores only index entries for users on that particular machine, >> (says http://wiki.apache.org/cassandra/SecondaryIndexes: >> "Each node only indexes data that it holds locally" ) >> but users might be stored by user name? And would thus be stored on >> many different machines? Even if they happen to live in the same >> state? >> >> Why won't the client need to query the indexes of [all servers that >> store info on users] to find all relevant users, when doing a user >> property lookup? >> >> >> Best regards, KajMagnus >> >