Re: Bitmap indexes - reviving CASSANDRA-1472
@Jason, I have a lot of experience with SOLR + ES, but mainly for search. (i.e. Finding the most relevant records given a query) That's been working well, but now we have requirements to support dashboards. Those dashboards have aggregations in them (sum, average, count(s), etc). I have limited experience using filter functions and facets to achieve similar things w/ Lucene, but they never seemed to perform well when the sets were large. If Lucene/SOLR/ES can support this kind of functionality, we'd gladly use it instead. (Let me know!) When we looked around, Druid seemed to fit the bill exactly: (and it was open source) http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-r ows-per-second/ BTW, here is more information on the compression that Druid uses: http://metamarkets.com/2012/druid-bitmap-compression/ To echo Matt's sentiment, we'd love to leverage a C* native capability for this. (Acunu provides most of the capability, but it isn't open source) I think once we have the conditional write semantics that are coming, we could layer this on top of C*. (extending the secondary indexes functionality) -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/12/13 12:46 AM, Matt Stump mrevilgn...@gmail.com wrote: You could embed Lucene, but then you pretty much have DSE search, and there are people on this list in a better position than I to describe the difficulty in making that scale. By rolling your own you get simplicity and control. If you use a uniform index size you can just assign chunks of it to the cassandra ring making it easy to distribute queries. I think that using Lucene in this way would cause most of the benefit of the library to be lost, and add unnecessary complexity. If Lucene were easy, then I think given the team's experience with both Lucene and C* it would have been done already. Sorry if it's a fuzzy answer, but I haven't run down every technical angle on the integration with C* yet. The idea was still very much in the wouldn't it be very cool if this thing lived in Cassandra. It would be the nail in the coffin for impala, redshift, et al. On Thu, Apr 11, 2013 at 3:15 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: What's the advantage over Lucene? On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump mrevilgn...@gmail.com wrote: Druid was our inspiration to layer bitmap indexes on top of Cassandra. Druid doesn't work for us because or data set is too large. We would need many hundreds of nodes just for the pre-processed data. What I envisioned was the ability to perform druid style queries (no aggregation) without the limitations imposed by having the entire dataset in memory. I primarily need to query whether a user performed some event, but I also intend to add trigram indexes for LIKE, ILIKE or possibly regex style matching. I wasn't aware of CONCISE, thanks for the pointer. We are currently evaluating fastbit, which is a very similar project: https://sdm.lbl.gov/fastbit/ On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill b...@alumni.brown.edu wrote: How does this compare with Druid? https://github.com/metamx/druid We're currently evaluating Acunu, Vertica and Druid... http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra. html With its bitmapped indexes, Druid appears to have the most potential. They boast some pretty impressive stats, especially WRT handling real-time updates and adding new dimensions. They also use a compression algorithm, CONCISE, to cut down on the space requirements. http://ricerca.mat.uniroma3.it/users/colanton/concise.html I haven't looked too deep into the Druid code, but I've been meaning to see if it could be backed by C*. We'd be game to join the hunt if you pursue such a beast. (with your code, or with portions of Druid) -brian On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote: What do you think about set manipulation via indexes in Cassandra? I'm interested in answering queries such as give me all users that performed event 1, 2, and 3, but not 4. If the
Re: Bitmap indexes - reviving CASSANDRA-1472
Brian, The Solr StatsComponent performs aggregations. http://wiki.apache.org/solr/StatsComponent I recommend using Datastax DSE Search... On Fri, Apr 12, 2013 at 10:09 AM, Brian O'Neill b...@alumni.brown.eduwrote: @Jason, I have a lot of experience with SOLR + ES, but mainly for search. (i.e. Finding the most relevant records given a query) That's been working well, but now we have requirements to support dashboards. Those dashboards have aggregations in them (sum, average, count(s), etc). I have limited experience using filter functions and facets to achieve similar things w/ Lucene, but they never seemed to perform well when the sets were large. If Lucene/SOLR/ES can support this kind of functionality, we'd gladly use it instead. (Let me know!) When we looked around, Druid seemed to fit the bill exactly: (and it was open source) http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-r ows-per-second/ BTW, here is more information on the compression that Druid uses: http://metamarkets.com/2012/druid-bitmap-compression/ To echo Matt's sentiment, we'd love to leverage a C* native capability for this. (Acunu provides most of the capability, but it isn't open source) I think once we have the conditional write semantics that are coming, we could layer this on top of C*. (extending the secondary indexes functionality) -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/12/13 12:46 AM, Matt Stump mrevilgn...@gmail.com wrote: You could embed Lucene, but then you pretty much have DSE search, and there are people on this list in a better position than I to describe the difficulty in making that scale. By rolling your own you get simplicity and control. If you use a uniform index size you can just assign chunks of it to the cassandra ring making it easy to distribute queries. I think that using Lucene in this way would cause most of the benefit of the library to be lost, and add unnecessary complexity. If Lucene were easy, then I think given the team's experience with both Lucene and C* it would have been done already. Sorry if it's a fuzzy answer, but I haven't run down every technical angle on the integration with C* yet. The idea was still very much in the wouldn't it be very cool if this thing lived in Cassandra. It would be the nail in the coffin for impala, redshift, et al. On Thu, Apr 11, 2013 at 3:15 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: What's the advantage over Lucene? On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump mrevilgn...@gmail.com wrote: Druid was our inspiration to layer bitmap indexes on top of Cassandra. Druid doesn't work for us because or data set is too large. We would need many hundreds of nodes just for the pre-processed data. What I envisioned was the ability to perform druid style queries (no aggregation) without the limitations imposed by having the entire dataset in memory. I primarily need to query whether a user performed some event, but I also intend to add trigram indexes for LIKE, ILIKE or possibly regex style matching. I wasn't aware of CONCISE, thanks for the pointer. We are currently evaluating fastbit, which is a very similar project: https://sdm.lbl.gov/fastbit/ On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill b...@alumni.brown.edu wrote: How does this compare with Druid? https://github.com/metamx/druid We're currently evaluating Acunu, Vertica and Druid... http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra . html With its bitmapped indexes, Druid appears to have the most potential. They boast some pretty impressive stats, especially WRT handling real-time updates and adding new dimensions. They also use a compression algorithm, CONCISE, to cut down on the space requirements. http://ricerca.mat.uniroma3.it/users/colanton/concise.html I haven't looked too deep into the Druid code, but I've been meaning to see if it could be backed by C*. We'd be game to join the hunt if you
Re: Bitmap indexes - reviving CASSANDRA-1472
Something like this? SELECT * FROM users WHERE user_id IN (select user_id from events where type in (1, 2, 3)) AND user_id NOT IN (select user_id from events where type=4) This doesn't really look like a Cassandra query to me. More like a query for Hive (or Drill, or Impala). But, I know Sylvain is looking forward to adding index support to Collections [1], so something like this might fit: SELECT * FROM users WHERE (events CONTAINS 1 OR events CONTAINS 2 OR events CONTAINS 3) AND NOT (events CONTAINS 4) However, even this is more than our current query planner can handle; we don't really handle disjunctions at all, except for the special case of IN on the partition key (which translates to multiget), let alone arbitrary logical predicates. I think that between bitmap indexes and query planning, the latter is actually the hard part. QueryProcessor is about at the limits of tractable complexity already; I think we'd need a new approach if we want to handle arbitrarily complex predicates like that. [1] https://issues.apache.org/jira/browse/CASSANDRA-4511 On Wed, Apr 10, 2013 at 4:40 PM, mrevilgnome mrevilgn...@gmail.com wrote: What do you think about set manipulation via indexes in Cassandra? I'm interested in answering queries such as give me all users that performed event 1, 2, and 3, but not 4. If the answer is yes than I can make a case for spending my time on C*. The only downside for us would be our current prototype is in C++ so we would loose some performance and the ability to dedicate an entire machine to caching/performing queries. On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis jbel...@gmail.com wrote: If you mean, Can someone help me figure out how to get started updating these old patches to trunk and cleaning out the Avro? then yes, I've been knee-deep in indexing code recently. On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome mrevilgn...@gmail.com wrote: I'm currently building a distributed cluster on top of cassandra to perform fast set manipulation via bitmap indexes. This gives me the ability to perform unions, intersections, and set subtraction across sub-queries. Currently I'm storing index information for thousands of dimensions as cassandra rows, and my cluster keeps this information cached, distributed and replicated in order to answer queries. Every couple of days I think to myself this should really exist in C*. Given all the benifits would there be any interest in reviving CASSANDRA-1472? Some downsides are that this is very memory intensive, even for sparse bitmaps. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Re: Bitmap indexes - reviving CASSANDRA-1472
I am not sure about the collection case. But for compact storage you can specify multiple-ranges in a slice query. https://issues.apache.org/jira/browse/CASSANDRA-3885 I am not sure this will get you all the way to bit-map indexes but in a wide row scenario it seems like you could support a event contains 1 or event contains 2 or event contains 3 I am not sure how arbitrarily complex the CQL query handler can/will become. For intravert (something I am dabling with) the concept is to apply a server side function to the result of a slice. https://github.com/zznate/intravert-ug/wiki/Filter-mode There is a huge win in having multiple indexes behind the plugable index support, not all of the plugable indexes and query options will be easy to CQL-ify. On Fri, Apr 12, 2013 at 10:52 AM, Jonathan Ellis jbel...@gmail.com wrote: Something like this? SELECT * FROM users WHERE user_id IN (select user_id from events where type in (1, 2, 3)) AND user_id NOT IN (select user_id from events where type=4) This doesn't really look like a Cassandra query to me. More like a query for Hive (or Drill, or Impala). But, I know Sylvain is looking forward to adding index support to Collections [1], so something like this might fit: SELECT * FROM users WHERE (events CONTAINS 1 OR events CONTAINS 2 OR events CONTAINS 3) AND NOT (events CONTAINS 4) However, even this is more than our current query planner can handle; we don't really handle disjunctions at all, except for the special case of IN on the partition key (which translates to multiget), let alone arbitrary logical predicates. I think that between bitmap indexes and query planning, the latter is actually the hard part. QueryProcessor is about at the limits of tractable complexity already; I think we'd need a new approach if we want to handle arbitrarily complex predicates like that. [1] https://issues.apache.org/jira/browse/CASSANDRA-4511 On Wed, Apr 10, 2013 at 4:40 PM, mrevilgnome mrevilgn...@gmail.com wrote: What do you think about set manipulation via indexes in Cassandra? I'm interested in answering queries such as give me all users that performed event 1, 2, and 3, but not 4. If the answer is yes than I can make a case for spending my time on C*. The only downside for us would be our current prototype is in C++ so we would loose some performance and the ability to dedicate an entire machine to caching/performing queries. On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis jbel...@gmail.com wrote: If you mean, Can someone help me figure out how to get started updating these old patches to trunk and cleaning out the Avro? then yes, I've been knee-deep in indexing code recently. On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome mrevilgn...@gmail.com wrote: I'm currently building a distributed cluster on top of cassandra to perform fast set manipulation via bitmap indexes. This gives me the ability to perform unions, intersections, and set subtraction across sub-queries. Currently I'm storing index information for thousands of dimensions as cassandra rows, and my cluster keeps this information cached, distributed and replicated in order to answer queries. Every couple of days I think to myself this should really exist in C*. Given all the benifits would there be any interest in reviving CASSANDRA-1472? Some downsides are that this is very memory intensive, even for sparse bitmaps. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced