On the scaleability and performance side, I found Yahoo's paper about the YCSB project interesting (benchmarking some NoSQL solutions with MySQL). See research.yahoo.com/files/*ycsb*.*pdf.
*My concern with the denormalization approach is that it shouldn't be managed by the client side because this has big impact on your throughput. Is the map-reduce in that respect any better? Wouldn't it be nice to support a kind of PL-(No)SQL server side scripting that allows you to create and maintain materialized views? You might still give it as an option to maintain the view synchronously (extension of current row-level-atomicity) or asynchronously. Not sure how complicated this support would be... - David On Mon, May 10, 2010 at 10:38 PM, Paul Prescod <p...@prescod.net> wrote: > On Mon, May 10, 2010 at 1:23 PM, Peter Hsu <pe...@motivecast.com> wrote: > > Thanks for the response, Paul. > > ... > > > > * Cassandra and its siblings are weak at ad hoc queries on tables > > that you did not think to index in advance > > > > What is the normal way of dealing with this in Cassandra? Would you just > > create a new "index" and bring a big honking machine to the table to > process > > all the existing data in the database and store the new "index"? > > The latest version of Cassandra introduces a "map/reduce" paradigm > which is the main tool you'd use for batch processing of data. You > could either use that to DO your ad hoc query or to process the data > into an index for more efficient ad hoc queries in the future. > > * http://en.wikipedia.org/wiki/MapReduce > > * http://en.wikipedia.org/wiki/Hadoop > > * http://architects.dzone.com/news/cassandra-adds-hadoop > > You can read criticisms of MapReduce in the first link there. > > > On May 10, 2010, at 11:22 AM, Paul Prescod wrote: > > > > This is a very, very big topic. For the most part, the issues are > > covered in the various SQL versus NoSQL debates all over the Internet. > > For example: > > > > * Cassandra and its NoSQL siblings have no concept of an in-database > "join" > > > > * Cassandra and its NoSQL siblings do not allow you to update > > multiple "tables" in a single transactions > > > > * Cassandra's API is specific to it, and not portable to any other data > > store > > > > * Cassandra currently has simplistic facilities to deal with various > > kinds of conflicting write. > > > > * Cassandra is strongly optimized for multiple machine distributions, > > whereas relational databases tend to be optimized for a single > > powerful machine. > > > > * Cassandra and its siblings are weak at ad hoc queries on tables > > that you did not think to index in advance > > > > On Mon, May 10, 2010 at 11:06 AM, Peter Hsu <pe...@motivecast.com> > wrote: > > > > I've seen a lot of threads and posts about why Cassandra is great. I'm > > fairly sold on the features, and the few big deployments on Cassandra > give > > it a lot of credibility. > > > > However, I don't believe in magic bullets, so I really want to understand > > the potential downsides of Cassandra. Right now, I don't really have a > clue > > as to what Cassandra is bad at. I took a look at > > http://wiki.apache.org/cassandra/CassandraLimitations which is helpful, > but > > doesn't characterize its weaknesses in ways that I can really comprehend > > until I've actually used Cassandra and understand some of the internals. > It > > seems that the community would benefit from being able to answer some of > > these questions in terms of real world use cases. > > > > My main questions: > > > > * Are there designs in which a SQL database out-performs or out-scales > > Cassandra? > > > > * Is there a pros vs cons page of Cassandra against an open source SQL > > database (MySQL or Postgres)? > > > > I do plan on attending the training session next Friday in Palo Alto, but > > it'd be great if I had some more food for thought before I attend. > > > > > > > > >