Hey y'all, There've been a few questions about distributed database solutions (a partial list: HBase, Voldemort, Memcached, ThruDB, CouchDB, Ringo, Scalaris, Kai, Dynomite, Cassandra, Hypertable, as well as the closed Dynamo, BigTable, SimpleDB).
For someone using Hadoop at scale, what problem aspects would recommend one of those over another? And in your subjective judgement, do any of these seem especially likely to succeed? Richard Jones of Last.fm just posted an overview with a great deal of engineering insight: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ His focus is a production web server farm, and so in some ways orthogonal to the crowd here -- but still highly recommended. Swaroop CH of Yahoo wrote a broad introduction to distributed DBs I also found useful: http://www.swaroopch.com/notes/Distributed_Storage_Systems Both give HBase short shrift, though my impression is that it is the leader among open projects for massive unordered dataset problems. The answer also, though, doesn't seem to be a simple "If you're using Hadoop you should be using HBase, dummy." I don't have the expertise to write this kind of overview from the hadoop / big data perspective but would eagerly read such an article from someone who does, or to summarize the insights of the list. === In lieu yet of such a summary, pointers to a few relevant threads: * http://www.nabble.com/Why-is-scaling-HBase-much-simpler-then-scaling-a-relational-db--tt18869660.html#a19093685 (especially Jonathan Gray's breakdown) * "HBase Performance" http://www.mail-archive.com/hadoop-u...@lucene.apache.org/msg02540.html (and the paper by Stonebraker and friends: http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf) * http://www.nabble.com/Serving-contents-of-large-MapFiles-SequenceFiles-from-memory-across-many-machines-tt19546012.html#a19574917 * On specific problem domains: http://www.nabble.com/Indexed-Hashtables-tt21470024.html#a21470848 http://www.nabble.com/Why-can%27t-Hadoop-be-used-for-online-applications---tt19461962.html#a19471894 http://www.nabble.com/Architecture-question.-tt21100766.html#a21100766 flip (noted in passing: a huge proportion of the development seems to be coming out of commercial enterprises and not the academic/HPC community. I worry my ivory tower is hung up on big iron and the top500.org list, at the expense of solving the many interesting problems these unlock.) -- http://www.infochimps.org Connected Open Free Data