Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-12 Thread Brian O'Neill
@Jason,

I have a lot of experience with SOLR + ES, but mainly for search.  (i.e.
Finding the most relevant records given a query)
That's been working well, but now we have requirements to support
dashboards.  Those dashboards have aggregations in them (sum, average,
count(s), etc).  I have limited experience using filter functions and
facets to achieve similar things w/ Lucene, but they never seemed to
perform well when the sets were large.

If Lucene/SOLR/ES can support this kind of functionality, we'd gladly use
it instead. (Let me know!)

When we looked around, Druid seemed to fit the bill exactly: (and it was
open source)
http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-r
ows-per-second/

BTW, here is more information on the compression that Druid uses:
http://metamarkets.com/2012/druid-bitmap-compression/


To echo Matt's sentiment, we'd love to leverage a C* native capability for
this.
(Acunu provides most of the capability, but it isn't open source)

I think once we have the conditional write semantics that are coming, we
could layer this on top of C*. (extending the secondary indexes
functionality)

-brian



---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 4/12/13 12:46 AM, Matt Stump mrevilgn...@gmail.com wrote:

You could embed Lucene, but then you pretty much have DSE search, and
there
are people on this list in a better position than I to describe
the difficulty in making that scale. By rolling your own you get
simplicity
and control. If you use a uniform index size you can just assign chunks of
it to the cassandra ring making it easy to distribute queries. I think
that
using Lucene in this way would cause most of the benefit of the library to
be lost, and add unnecessary complexity. If Lucene were easy, then I think
given the team's experience with both Lucene and C* it would have been
done
already.

Sorry if it's a fuzzy answer, but I haven't run down every technical angle
on the integration with C* yet. The idea was still very much in the
wouldn't it be very cool if this thing lived in Cassandra. It would be the
nail in the coffin for impala, redshift, et al.


On Thu, Apr 11, 2013 at 3:15 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 What's the advantage over Lucene?


 On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump mrevilgn...@gmail.com
 wrote:

  Druid was our inspiration to layer bitmap indexes on top of Cassandra.
  Druid doesn't work for us because or data set is too large. We would
need
  many hundreds of nodes just for the pre-processed data. What I
envisioned
  was the ability to perform druid style queries (no aggregation)
without
 the
  limitations imposed by having the entire dataset in memory. I
primarily
  need to query whether a user performed some event, but I also intend
to
 add
  trigram indexes for LIKE, ILIKE or possibly regex style matching.
 
  I wasn't aware of CONCISE, thanks for the pointer. We are currently
  evaluating fastbit, which is a very similar project:
  https://sdm.lbl.gov/fastbit/
 
 
  On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill b...@alumni.brown.edu
  wrote:
 
  
   How does this compare with Druid?
   https://github.com/metamx/druid
  
   We're currently evaluating Acunu, Vertica and Druid...
  
  
 
 
http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra.
html
  
   With its bitmapped indexes, Druid appears to have the most
potential.
   They boast some pretty impressive stats, especially WRT handling
   real-time updates and adding new dimensions.
  
   They also use a compression algorithm, CONCISE, to cut down on the
 space
   requirements.
   http://ricerca.mat.uniroma3.it/users/colanton/concise.html
  
   I haven't looked too deep into the Druid code, but I've been
meaning to
   see if it could be backed by C*.
  
   We'd be game to join the hunt if you pursue such a beast. (with your
  code,
   or with portions of Druid)
  
   -brian
  
  
   On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote:
  
What do you think about set manipulation via indexes in Cassandra?
 I'm
interested in answering queries such as give me all users that
  performed
event 1, 2, and 3, but not 4. If the 

Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-12 Thread Jason Rutherglen
Brian,

The Solr StatsComponent performs aggregations.

http://wiki.apache.org/solr/StatsComponent

I recommend using Datastax DSE Search...


On Fri, Apr 12, 2013 at 10:09 AM, Brian O'Neill b...@alumni.brown.eduwrote:

 @Jason,

 I have a lot of experience with SOLR + ES, but mainly for search.  (i.e.
 Finding the most relevant records given a query)
 That's been working well, but now we have requirements to support
 dashboards.  Those dashboards have aggregations in them (sum, average,
 count(s), etc).  I have limited experience using filter functions and
 facets to achieve similar things w/ Lucene, but they never seemed to
 perform well when the sets were large.

 If Lucene/SOLR/ES can support this kind of functionality, we'd gladly use
 it instead. (Let me know!)

 When we looked around, Druid seemed to fit the bill exactly: (and it was
 open source)
 http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-r
 ows-per-second/

 BTW, here is more information on the compression that Druid uses:
 http://metamarkets.com/2012/druid-bitmap-compression/


 To echo Matt's sentiment, we'd love to leverage a C* native capability for
 this.
 (Acunu provides most of the capability, but it isn't open source)

 I think once we have the conditional write semantics that are coming, we
 could layer this on top of C*. (extending the secondary indexes
 functionality)

 -brian



 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive € King of Prussia, PA € 19406
 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.







 On 4/12/13 12:46 AM, Matt Stump mrevilgn...@gmail.com wrote:

 You could embed Lucene, but then you pretty much have DSE search, and
 there
 are people on this list in a better position than I to describe
 the difficulty in making that scale. By rolling your own you get
 simplicity
 and control. If you use a uniform index size you can just assign chunks of
 it to the cassandra ring making it easy to distribute queries. I think
 that
 using Lucene in this way would cause most of the benefit of the library to
 be lost, and add unnecessary complexity. If Lucene were easy, then I think
 given the team's experience with both Lucene and C* it would have been
 done
 already.
 
 Sorry if it's a fuzzy answer, but I haven't run down every technical angle
 on the integration with C* yet. The idea was still very much in the
 wouldn't it be very cool if this thing lived in Cassandra. It would be the
 nail in the coffin for impala, redshift, et al.
 
 
 On Thu, Apr 11, 2013 at 3:15 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:
 
  What's the advantage over Lucene?
 
 
  On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump mrevilgn...@gmail.com
  wrote:
 
   Druid was our inspiration to layer bitmap indexes on top of Cassandra.
   Druid doesn't work for us because or data set is too large. We would
 need
   many hundreds of nodes just for the pre-processed data. What I
 envisioned
   was the ability to perform druid style queries (no aggregation)
 without
  the
   limitations imposed by having the entire dataset in memory. I
 primarily
   need to query whether a user performed some event, but I also intend
 to
  add
   trigram indexes for LIKE, ILIKE or possibly regex style matching.
  
   I wasn't aware of CONCISE, thanks for the pointer. We are currently
   evaluating fastbit, which is a very similar project:
   https://sdm.lbl.gov/fastbit/
  
  
   On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill b...@alumni.brown.edu
   wrote:
  
   
How does this compare with Druid?
https://github.com/metamx/druid
   
We're currently evaluating Acunu, Vertica and Druid...
   
   
  
 
 http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra
 .
 html
   
With its bitmapped indexes, Druid appears to have the most
 potential.
They boast some pretty impressive stats, especially WRT handling
real-time updates and adding new dimensions.
   
They also use a compression algorithm, CONCISE, to cut down on the
  space
requirements.
http://ricerca.mat.uniroma3.it/users/colanton/concise.html
   
I haven't looked too deep into the Druid code, but I've been
 meaning to
see if it could be backed by C*.
   
We'd be game to join the hunt if you 

Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-12 Thread Jonathan Ellis
Something like this?

SELECT * FROM users
WHERE user_id IN (select user_id from events where type in (1, 2, 3))
  AND user_id NOT IN (select user_id from events where type=4)

This doesn't really look like a Cassandra query to me.  More like a
query for Hive (or Drill, or Impala).

But, I know Sylvain is looking forward to adding index support to
Collections [1], so something like this might fit:

SELECT * FROM users
WHERE (events CONTAINS 1 OR events CONTAINS 2 OR events CONTAINS 3)
   AND NOT (events CONTAINS 4)

However, even this is more than our current query planner can handle;
we don't really handle disjunctions at all, except for the special
case of IN on the partition key (which translates to multiget), let
alone arbitrary logical predicates.

I think that between bitmap indexes and query planning, the latter
is actually the hard part.  QueryProcessor is about at the limits of
tractable complexity already; I think we'd need a new approach if we
want to handle arbitrarily complex predicates like that.

[1] https://issues.apache.org/jira/browse/CASSANDRA-4511


On Wed, Apr 10, 2013 at 4:40 PM, mrevilgnome mrevilgn...@gmail.com wrote:
 What do you think about set manipulation via indexes in Cassandra? I'm
 interested in answering queries such as give me all users that performed
 event 1, 2, and 3, but not 4. If the answer is yes than I can make a case
 for spending my time on C*. The only downside for us would be our current
 prototype is in C++ so we would loose some performance and the ability to
 dedicate an entire machine to caching/performing queries.


 On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis jbel...@gmail.com wrote:

 If you mean, Can someone help me figure out how to get started updating
 these old patches to trunk and cleaning out the Avro? then yes, I've been
 knee-deep in indexing code recently.


 On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome mrevilgn...@gmail.com
 wrote:

  I'm currently building a distributed cluster on top of cassandra to
 perform
  fast set manipulation via bitmap indexes. This gives me the ability to
  perform unions, intersections, and set subtraction across sub-queries.
  Currently I'm storing index information for thousands of dimensions as
  cassandra rows, and my cluster keeps this information cached, distributed
  and replicated in order to answer queries.
 
  Every couple of days I think to myself this should really exist in C*.
  Given all the benifits would there be any interest in
  reviving CASSANDRA-1472?
 
  Some downsides are that this is very memory intensive, even for sparse
  bitmaps.
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder, http://www.datastax.com
 @spyced




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced


Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-12 Thread Edward Capriolo
I am not sure about the collection case. But for compact storage you can
specify multiple-ranges in a slice query.

https://issues.apache.org/jira/browse/CASSANDRA-3885

I am not sure this will get you all the way to bit-map indexes but in a
wide row scenario it seems like you could support a event contains 1 or
event contains 2 or event contains 3

I am not sure how arbitrarily complex the CQL query handler can/will
become. For intravert (something I am dabling with) the concept is to apply
a server side function to the result of a slice.

https://github.com/zznate/intravert-ug/wiki/Filter-mode

There is a huge win in having multiple indexes behind the plugable index
support, not all of the plugable indexes and query options will be easy to
CQL-ify.




On Fri, Apr 12, 2013 at 10:52 AM, Jonathan Ellis jbel...@gmail.com wrote:

 Something like this?

 SELECT * FROM users
 WHERE user_id IN (select user_id from events where type in (1, 2, 3))
   AND user_id NOT IN (select user_id from events where type=4)

 This doesn't really look like a Cassandra query to me.  More like a
 query for Hive (or Drill, or Impala).

 But, I know Sylvain is looking forward to adding index support to
 Collections [1], so something like this might fit:

 SELECT * FROM users
 WHERE (events CONTAINS 1 OR events CONTAINS 2 OR events CONTAINS 3)
AND NOT (events CONTAINS 4)

 However, even this is more than our current query planner can handle;
 we don't really handle disjunctions at all, except for the special
 case of IN on the partition key (which translates to multiget), let
 alone arbitrary logical predicates.

 I think that between bitmap indexes and query planning, the latter
 is actually the hard part.  QueryProcessor is about at the limits of
 tractable complexity already; I think we'd need a new approach if we
 want to handle arbitrarily complex predicates like that.

 [1] https://issues.apache.org/jira/browse/CASSANDRA-4511


 On Wed, Apr 10, 2013 at 4:40 PM, mrevilgnome mrevilgn...@gmail.com
 wrote:
  What do you think about set manipulation via indexes in Cassandra? I'm
  interested in answering queries such as give me all users that performed
  event 1, 2, and 3, but not 4. If the answer is yes than I can make a case
  for spending my time on C*. The only downside for us would be our current
  prototype is in C++ so we would loose some performance and the ability to
  dedicate an entire machine to caching/performing queries.
 
 
  On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  If you mean, Can someone help me figure out how to get started updating
  these old patches to trunk and cleaning out the Avro? then yes, I've
 been
  knee-deep in indexing code recently.
 
 
  On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome mrevilgn...@gmail.com
  wrote:
 
   I'm currently building a distributed cluster on top of cassandra to
  perform
   fast set manipulation via bitmap indexes. This gives me the ability to
   perform unions, intersections, and set subtraction across sub-queries.
   Currently I'm storing index information for thousands of dimensions as
   cassandra rows, and my cluster keeps this information cached,
 distributed
   and replicated in order to answer queries.
  
   Every couple of days I think to myself this should really exist in C*.
   Given all the benifits would there be any interest in
   reviving CASSANDRA-1472?
  
   Some downsides are that this is very memory intensive, even for sparse
   bitmaps.
  
 
 
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder, http://www.datastax.com
  @spyced
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder, http://www.datastax.com
 @spyced