Re: HBase and Cassandra on StackOverflow

Andrew Purtell Tue, 30 Aug 2011 09:17:32 -0700

> Is the replication strategy for HBase completely reliant on HDFS' block
> replication pipelining ?


Yes.

> Is this replication process asynchronous ? 


No.
Best regards,


       - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


>________________________________
>From: Sam Seigal <selek...@yahoo.com>
>To: user@hbase.apache.org; Andrew Purtell <apurt...@apache.org>
>Cc: "hbase-u...@hadoop.apache.org" <hbase-u...@hadoop.apache.org>
>Sent: Tuesday, August 30, 2011 7:35 PM
>Subject: Re: HBase and Cassandra on StackOverflow
>
>A question inline:
>
>On Tue, Aug 30, 2011 at 2:47 AM, Andrew Purtell <apurt...@apache.org> wrote:
>
>> Hi Chris,
>>
>> Appreciate your answer on the post.
>>
>> Personally speaking however the endless Cassandra vs. HBase discussion is
>> tiresome and rarely do blog posts or emails in this regard shed any light.
>> Often, Cassandra proponents mis-state their case out of ignorance of HBase
>> or due to commercial or personal agendas. It is difficult to find clear eyed
>> analysis among the partisans. I'm not sure it will make any difference
>> posting a rebuttal to some random thing jbellis says. Better to focus on
>> improving HBase than play whack a mole.
>>
>>
>> Regarding some of the specific points in that post:
>>
>> HBase is proven in production deployments larger than the largest publicly
>> reported Cassandra cluster, ~1K versus 400 or 700 or somesuch. But basically
>> this is the same order of magnitude, with HBase having a slight edge. I
>> don't see a meaningful difference here. Stating otherwise is false.
>>
>> HBase supports replication between clusters (i.e. data centers). I believe,
>> but admit I'm not super familiar with the Cassandra option here, that the
>> main difference is HBase provides simple mechanism and the user must build a
>> replication architecture useful for them; while Cassandra attempts to hide
>> some of that complexity. I do not know if they succeed there, but large
>> scale cross data center replication is rarely one size fits all so I doubt
>> it.
>>
>> Cassandra does not have strong consistency in the sense that HBase
>> provides. It can provide strong consistency, but at the cost of failing any
>> read if there is insufficient quorum. HBase/HDFS does not have that
>> limitation. On the other hand, HBase has its own and different scenarios
>> where data may not be immediately available. The differences between the
>> systems are nuanced and which to use depends on the use case requirements.
>>
>>
>I have a question regarding this point. Is the replication strategy for
>HBase completely reliant on HDFS' block replication pipelining ? Is this
>replication process asynchronous ? If it is, then is there not a window,
>where when a machine is to die and the replication pipeline for a particular
>block has not started yet, that block will be unavailable until the machine
>comes back up ? Sorry, if I am missing something important here.
>
>
>> Cassandra's RandomPartitioner / hash based partitioning means efficient
>> MapReduce or table scanning is not possible, whereas HBase's distributed
>> ordered tree is naturally efficient for such use cases, I believe explaining
>> why Hadoop users often prefer it. This may or may not be a problem for any
>> given use case. Using an ordered partitioner with Cassandra used to require
>> frequent manual rebalancing to avoid blowing up nodes. I don't know if more
>> recent versions still have this mis-feature.
>>
>> Cassandra is no less complex than HBase. All of this complexity is "hidden"
>> in the sense that with Hadoop/HBase the layering is obvious -- HDFS, HBase,
>> etc. -- but the Cassandra internals are no less layered. An impartial
>> analysis of implementation and algorithms will reveal that Cassandra's
>> theory of operation in its full detail is substantially more complex.
>> Compare the BigTable and Dynamo papers and this is clear. There are actually
>> more opportunities for something to go wrong with Cassandra.
>>
>> While we are looking at codebases, it should be noted that HBase has
>> substantially more unit tests.
>>
>> With Cassandra, all RPC is via Thrift with various wrappers, so actually
>> all Cassandra clients are second class in the sense that jbellis means when
>> he states "Non-Java clients are not second-class citizens".
>>
>> The master-slave versus peer-to-peer argument is larger than Cassandra vs.
>> HBase, and not nearly as one sided as claimed. The famous (infamous?) global
>> failure of Amazon's S3 in 2008, a fully peer-to-peer system, due to a single
>> flipped bit in a gossip message demonstrates how in peer to peer systems
>> every node can be a single point of failure. There is no obvious winner,
>> instead, a series of trade offs. Claiming otherwise is intellectually
>> dishonest. Master-slave architectures seem easier to operate and reason
>> about in my experience. Of course, I'm partial there.
>>
>> I have just scratched the surface.
>>
>>
>> Best regards,
>>
>>
>>        - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>>
>> >________________________________
>> >From: Chris Tarnas <c...@email.com>
>> >To: hbase-u...@hadoop.apache.org
>> >Sent: Tuesday, August 30, 2011 2:02 PM
>> >Subject: HBase and Cassandra on StackOverflow
>> >
>> >Someone with better knowledge than might be interested in helping answer
>> this question over at StackOverflow:
>> >
>> >
>> http://stackoverflow.com/questions/7237271/large-scale-data-processing-hbase-cassandra
>> >
>> >-chris
>> >
>> >
>>
>
>
>

Re: HBase and Cassandra on StackOverflow

Reply via email to