Hi Chris,

Appreciate your answer on the post.

Personally speaking however the endless Cassandra vs. HBase discussion is 
tiresome and rarely do blog posts or emails in this regard shed any light. 
Often, Cassandra proponents mis-state their case out of ignorance of HBase or 
due to commercial or personal agendas. It is difficult to find clear eyed 
analysis among the partisans. I'm not sure it will make any difference posting 
a rebuttal to some random thing jbellis says. Better to focus on improving 
HBase than play whack a mole.


Regarding some of the specific points in that post:

HBase is proven in production deployments larger than the largest publicly 
reported Cassandra cluster, ~1K versus 400 or 700 or somesuch. But basically 
this is the same order of magnitude, with HBase having a slight edge. I don't 
see a meaningful difference here. Stating otherwise is false.

HBase supports replication between clusters (i.e. data centers). I believe, but 
admit I'm not super familiar with the Cassandra option here, that the main 
difference is HBase provides simple mechanism and the user must build a 
replication architecture useful for them; while Cassandra attempts to hide some 
of that complexity. I do not know if they succeed there, but large scale cross 
data center replication is rarely one size fits all so I doubt it.

Cassandra does not have strong consistency in the sense that HBase provides. It 
can provide strong consistency, but at the cost of failing any read if there is 
insufficient quorum. HBase/HDFS does not have that limitation. On the other 
hand, HBase has its own and different scenarios where data may not be 
immediately available. The differences between the systems are nuanced and 
which to use depends on the use case requirements.

Cassandra's RandomPartitioner / hash based partitioning means efficient 
MapReduce or table scanning is not possible, whereas HBase's distributed 
ordered tree is naturally efficient for such use cases, I believe explaining 
why Hadoop users often prefer it. This may or may not be a problem for any 
given use case. Using an ordered partitioner with Cassandra used to require 
frequent manual rebalancing to avoid blowing up nodes. I don't know if more 
recent versions still have this mis-feature.

Cassandra is no less complex than HBase. All of this complexity is "hidden" in 
the sense that with Hadoop/HBase the layering is obvious -- HDFS, HBase, etc. 
-- but the Cassandra internals are no less layered. An impartial analysis of 
implementation and algorithms will reveal that Cassandra's theory of operation 
in its full detail is substantially more complex. Compare the BigTable and 
Dynamo papers and this is clear. There are actually more opportunities for 
something to go wrong with Cassandra.

While we are looking at codebases, it should be noted that HBase has 
substantially more unit tests.

With Cassandra, all RPC is via Thrift with various wrappers, so actually all 
Cassandra clients are second class in the sense that jbellis means when he 
states "Non-Java clients are not second-class citizens".

The master-slave versus peer-to-peer argument is larger than Cassandra vs. 
HBase, and not nearly as one sided as claimed. The famous (infamous?) global 
failure of Amazon's S3 in 2008, a fully peer-to-peer system, due to a single 
flipped bit in a gossip message demonstrates how in peer to peer systems every 
node can be a single point of failure. There is no obvious winner, instead, a 
series of trade offs. Claiming otherwise is intellectually dishonest. 
Master-slave architectures seem easier to operate and reason about in my 
experience. Of course, I'm partial there.

I have just scratched the surface.


Best regards,


       - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


>________________________________
>From: Chris Tarnas <c...@email.com>
>To: hbase-u...@hadoop.apache.org
>Sent: Tuesday, August 30, 2011 2:02 PM
>Subject: HBase and Cassandra on StackOverflow
>
>Someone with better knowledge than might be interested in helping answer this 
>question over at StackOverflow:
>
>http://stackoverflow.com/questions/7237271/large-scale-data-processing-hbase-cassandra
>
>-chris
>
>

Reply via email to