For more architectural details of HBase, check out the bigtable paper, it's fairly detailed, short and accessible.
On Sat, May 8, 2010 at 2:39 PM, Amandeep Khurana <ama...@gmail.com> wrote: > HBase does not do in-memory replication. Your data goes into a region, which > has only one instance. Writes go to the write ahead log first, which is > written to the disk. However, since HDFS doesnt yet have a fully performing > flush functionality, there is a chance of losing the chunk of data. The next > release of HBase will guarantee data durability since by then the flush > functionality would be fully working. > > Regarding replication - the difference between Cassandra and HBase is that > when you do a write in Cassandra, it doesnt return unless it has written to > W nodes, which is configurable. In case of HBase, the replication is taken > care of by the filesystem (HDFS). When the region is flushed to the disk, > HDFS replicates the HFiles (in which the data for the regions is stored). > For more details of the working, read the Bigtable paper and > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html. > > > 2010/5/8 MauMau <maumau...@gmail.com> > >> Hello, >> >> I'm comparing HBase and Cassandra, which I think are the most promising >> distributed key-value stores, to determine which one to choose for the >> future OLTP and data analysis. >> I found the following benchmark report by Yahoo! Research which evalutes >> HBase, Cassandra, PNUTS, and sharded MySQL. >> >> http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf >> http://www.brianfrankcooper.net/pubs/ycsb.pdf >> >> The above report refers to HBase 0.20.3. >> Reading this and HBase's documentation, two questions about load balancing >> and replication have risen. Could anyone give me any information to help >> solve these questions? >> >> [Q2] replication >> Does HBase perform in-memory replication of rows like Cassandra? >> Does HBase sync updates to disk before returing success to clients? >> >> According to the following paragraph in HBase design overview, HBase syncs >> writes. >> >> ---------------------------------------- >> Write Requests >> When a write request is received, it is first written to a write-ahead log >> called a HLog. All write requests for every region the region server is >> serving are written to the same HLog. Once the request has been written to >> the HLog, the result of changes is stored in an in-memory cache called the >> Memcache. There is one Memcache for each Store. >> ---------------------------------------- >> >> The source code of Put class appear to show the above (though I don't >> understand the server-side code yet): >> >> private boolean writeToWAL = true; >> >> However, Yahoo's report writes as follows. Is this incorrect? What is >> in-memory replication? I know HBase relies on HDFS to replicate data on the >> storage, but not in memory. >> >> ---------------------------------------- >> For Cassandra, sharded MySQL and PNUTS, all updates were >> synched to disk before returning to the client. HBase does >> not sync to disk, but relies on in-memory replication across >> multiple servers for durability; this increases write throughput >> and reduces latency, but can result in data loss on failure. >> ---------------------------------------- >> >> Maumau >> >> >