Hi Chris, Appreciate your answer on the post.
Personally speaking however the endless Cassandra vs. HBase discussion is tiresome and rarely do blog posts or emails in this regard shed any light. Often, Cassandra proponents mis-state their case out of ignorance of HBase or due to commercial or personal agendas. It is difficult to find clear eyed analysis among the partisans. I'm not sure it will make any difference posting a rebuttal to some random thing jbellis says. Better to focus on improving HBase than play whack a mole. Regarding some of the specific points in that post: HBase is proven in production deployments larger than the largest publicly reported Cassandra cluster, ~1K versus 400 or 700 or somesuch. But basically this is the same order of magnitude, with HBase having a slight edge. I don't see a meaningful difference here. Stating otherwise is false. HBase supports replication between clusters (i.e. data centers). I believe, but admit I'm not super familiar with the Cassandra option here, that the main difference is HBase provides simple mechanism and the user must build a replication architecture useful for them; while Cassandra attempts to hide some of that complexity. I do not know if they succeed there, but large scale cross data center replication is rarely one size fits all so I doubt it. Cassandra does not have strong consistency in the sense that HBase provides. It can provide strong consistency, but at the cost of failing any read if there is insufficient quorum. HBase/HDFS does not have that limitation. On the other hand, HBase has its own and different scenarios where data may not be immediately available. The differences between the systems are nuanced and which to use depends on the use case requirements. Cassandra's RandomPartitioner / hash based partitioning means efficient MapReduce or table scanning is not possible, whereas HBase's distributed ordered tree is naturally efficient for such use cases, I believe explaining why Hadoop users often prefer it. This may or may not be a problem for any given use case. Using an ordered partitioner with Cassandra used to require frequent manual rebalancing to avoid blowing up nodes. I don't know if more recent versions still have this mis-feature. Cassandra is no less complex than HBase. All of this complexity is "hidden" in the sense that with Hadoop/HBase the layering is obvious -- HDFS, HBase, etc. -- but the Cassandra internals are no less layered. An impartial analysis of implementation and algorithms will reveal that Cassandra's theory of operation in its full detail is substantially more complex. Compare the BigTable and Dynamo papers and this is clear. There are actually more opportunities for something to go wrong with Cassandra. While we are looking at codebases, it should be noted that HBase has substantially more unit tests. With Cassandra, all RPC is via Thrift with various wrappers, so actually all Cassandra clients are second class in the sense that jbellis means when he states "Non-Java clients are not second-class citizens". The master-slave versus peer-to-peer argument is larger than Cassandra vs. HBase, and not nearly as one sided as claimed. The famous (infamous?) global failure of Amazon's S3 in 2008, a fully peer-to-peer system, due to a single flipped bit in a gossip message demonstrates how in peer to peer systems every node can be a single point of failure. There is no obvious winner, instead, a series of trade offs. Claiming otherwise is intellectually dishonest. Master-slave architectures seem easier to operate and reason about in my experience. Of course, I'm partial there. I have just scratched the surface. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) >________________________________ >From: Chris Tarnas <c...@email.com> >To: hbase-u...@hadoop.apache.org >Sent: Tuesday, August 30, 2011 2:02 PM >Subject: HBase and Cassandra on StackOverflow > >Someone with better knowledge than might be interested in helping answer this >question over at StackOverflow: > >http://stackoverflow.com/questions/7237271/large-scale-data-processing-hbase-cassandra > >-chris > >