> Is the replication strategy for HBase completely reliant on HDFS' block > replication pipelining ?
Yes. > Is this replication process asynchronous ? No. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) >________________________________ >From: Sam Seigal <selek...@yahoo.com> >To: user@hbase.apache.org; Andrew Purtell <apurt...@apache.org> >Cc: "hbase-u...@hadoop.apache.org" <hbase-u...@hadoop.apache.org> >Sent: Tuesday, August 30, 2011 7:35 PM >Subject: Re: HBase and Cassandra on StackOverflow > >A question inline: > >On Tue, Aug 30, 2011 at 2:47 AM, Andrew Purtell <apurt...@apache.org> wrote: > >> Hi Chris, >> >> Appreciate your answer on the post. >> >> Personally speaking however the endless Cassandra vs. HBase discussion is >> tiresome and rarely do blog posts or emails in this regard shed any light. >> Often, Cassandra proponents mis-state their case out of ignorance of HBase >> or due to commercial or personal agendas. It is difficult to find clear eyed >> analysis among the partisans. I'm not sure it will make any difference >> posting a rebuttal to some random thing jbellis says. Better to focus on >> improving HBase than play whack a mole. >> >> >> Regarding some of the specific points in that post: >> >> HBase is proven in production deployments larger than the largest publicly >> reported Cassandra cluster, ~1K versus 400 or 700 or somesuch. But basically >> this is the same order of magnitude, with HBase having a slight edge. I >> don't see a meaningful difference here. Stating otherwise is false. >> >> HBase supports replication between clusters (i.e. data centers). I believe, >> but admit I'm not super familiar with the Cassandra option here, that the >> main difference is HBase provides simple mechanism and the user must build a >> replication architecture useful for them; while Cassandra attempts to hide >> some of that complexity. I do not know if they succeed there, but large >> scale cross data center replication is rarely one size fits all so I doubt >> it. >> >> Cassandra does not have strong consistency in the sense that HBase >> provides. It can provide strong consistency, but at the cost of failing any >> read if there is insufficient quorum. HBase/HDFS does not have that >> limitation. On the other hand, HBase has its own and different scenarios >> where data may not be immediately available. The differences between the >> systems are nuanced and which to use depends on the use case requirements. >> >> >I have a question regarding this point. Is the replication strategy for >HBase completely reliant on HDFS' block replication pipelining ? Is this >replication process asynchronous ? If it is, then is there not a window, >where when a machine is to die and the replication pipeline for a particular >block has not started yet, that block will be unavailable until the machine >comes back up ? Sorry, if I am missing something important here. > > >> Cassandra's RandomPartitioner / hash based partitioning means efficient >> MapReduce or table scanning is not possible, whereas HBase's distributed >> ordered tree is naturally efficient for such use cases, I believe explaining >> why Hadoop users often prefer it. This may or may not be a problem for any >> given use case. Using an ordered partitioner with Cassandra used to require >> frequent manual rebalancing to avoid blowing up nodes. I don't know if more >> recent versions still have this mis-feature. >> >> Cassandra is no less complex than HBase. All of this complexity is "hidden" >> in the sense that with Hadoop/HBase the layering is obvious -- HDFS, HBase, >> etc. -- but the Cassandra internals are no less layered. An impartial >> analysis of implementation and algorithms will reveal that Cassandra's >> theory of operation in its full detail is substantially more complex. >> Compare the BigTable and Dynamo papers and this is clear. There are actually >> more opportunities for something to go wrong with Cassandra. >> >> While we are looking at codebases, it should be noted that HBase has >> substantially more unit tests. >> >> With Cassandra, all RPC is via Thrift with various wrappers, so actually >> all Cassandra clients are second class in the sense that jbellis means when >> he states "Non-Java clients are not second-class citizens". >> >> The master-slave versus peer-to-peer argument is larger than Cassandra vs. >> HBase, and not nearly as one sided as claimed. The famous (infamous?) global >> failure of Amazon's S3 in 2008, a fully peer-to-peer system, due to a single >> flipped bit in a gossip message demonstrates how in peer to peer systems >> every node can be a single point of failure. There is no obvious winner, >> instead, a series of trade offs. Claiming otherwise is intellectually >> dishonest. Master-slave architectures seem easier to operate and reason >> about in my experience. Of course, I'm partial there. >> >> I have just scratched the surface. >> >> >> Best regards, >> >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >> >> >> >________________________________ >> >From: Chris Tarnas <c...@email.com> >> >To: hbase-u...@hadoop.apache.org >> >Sent: Tuesday, August 30, 2011 2:02 PM >> >Subject: HBase and Cassandra on StackOverflow >> > >> >Someone with better knowledge than might be interested in helping answer >> this question over at StackOverflow: >> > >> > >> http://stackoverflow.com/questions/7237271/large-scale-data-processing-hbase-cassandra >> > >> >-chris >> > >> > >> > > >