At Ooyala, we're in the process of testing and productionizing Cassandra to store and serve our near real-time video analytics data. Ooyala provides a comprehensive platform for professional video publishers and enterprise companies looking to build up their online video presence, and analytics/monetization is a key part of the platform.
We researched a variety of systems to replace our current MySQL solution, including HBase, Cassandra, Voldemort, and some others. Of those, we seriously considered HBase and Cassandra as satisfying our needs b/c of HA, scaling, and the more fully featured data schema, which is a better fit for our high dimensional data. For both HBase and Cassandra, we designed data schemas, built functional prototypes of our application, conducted a fairly thorough performance evaluation, tested the two systems for various failure scenarios, and also evaluated how easy each system was to maintain and run. What I'd like to see in Cassandra: - More comments in the source code, esp. high-level descriptions of code organization. Design docs for various functionality would also be helpful in getting other folks to contribute. This was one area where HBase was significantly better. - Better bootstrapping and load balancing support (bootstrapping seemed broken in 0.4.2), but I've seen a lot of work done in these two areas for 0.5. Edmond On Fri, Nov 20, 2009 at 3:02 PM, Tim Underwood <timunderw...@gmail.com> wrote: > My company runs a niche comparison shopping site where we take in all sorts > of raw product data from various sources (retailers, manufacturers, > distributors, etc...). We then have to take all that raw data and collapse > it down across the data sources (e.g. product FOO from source A matches > product BAR from source B) and eventually end up with a final product that > gets surfaced to our website. > Cassandra's data model works great for the raw data where columns are > sparsely populated and updated. The SuperColumnFamily model works great for > my collapsed data where I need to track which bits of information came from > which raw data. > I'm currently in testing (almost production). For this use case I'll only > be using Cassandra on the backend and then indexing the final data into > Apache Solr to power the frontend. My data is small enough to fit on a > single node so I don't have much use for the partitioning at this point. If > anything I'd be more interested in a fully replicated setup where the > ReplicationFactor is equal to the number of nodes. > I looked at most of the other nosql solutions (couchdb, mongodb, hbase, > hypertable, dynomite, voldemort). > One thing I'd love to see improved: > - Reading through all the data (or a specific key prefix) in a ColumnFamily > seems slow. Cassandra is the bottleneck when I try to index data into Solr > and it looks like Cassandra's CPU usage is 2-3 times that of Solr's during > the process. > I look forward to playing around with 0.5! > -Tim > On Fri, Nov 20, 2009 at 1:17 PM, Jonathan Ellis <jbel...@gmail.com> wrote: >> >> Hi all, >> >> I'd love to get a better feel for who is using Cassandra and what kind >> of applications it is seeing. If you are using Cassandra, could you >> share what you're using it for and what stage you are at with it >> (evaluation / testing / production)? Also, what alternatives you >> evaluated/are evaluating would be useful. Finally, feel free to throw >> in "I'd love to use Cassandra if only it did X" wishes. :) >> >> I can start: Rackspace is using Cassandra for stats collection >> (testing, almost production) and as a backend for the Mail & Apps >> division (early testing). We evaluated HBase, Hypertable, dynomite, >> and Voldemort as well. >> >> Thanks, >> >> -Jonathan >> >> (If you're in stealth mode or don't want to say anything in public, >> feel free to reply to me privately and I will keep it off the record.) > >