Dear Ian, I appreciate so much for your detailed reply! I will read the book about HBase.
Best regards, Bing On Mon, Jan 30, 2012 at 2:36 PM, Ian Varley <ivar...@salesforce.com> wrote: > Bing, > > HBase uses an approach to structuring its storage known as "Log Structured > Merge Trees", which you can learn more about here: > > > http://scholar.google.com/scholar?q=log+structured+merge+tree&hl=en&as_sdt=0&as_vis=1&oi=scholart > > As well as in Lars George's great book, here: > > http://shop.oreilly.com/product/0636920014348.do > > It does all of these "frequent updates" just in memory, which is very > fast; at the same time, it writes a simple forward-only log of all edits > (known as the Write Ahead Log, or WAL) to disk in order to provide > durability in the event of machine failure. It periodically writes the > in-memory data to disk in big immutable ordered chunks, called "store > files", which is very efficient. Future reads of the data then "merge" the > on-disk store file data with the current state in memory, to get the full > picture of the state of any row. Over time, the many small store files get > "compacted" into bigger files, so that individual reads don't have too many > files to read from. Each "get" or "scan" operation can just read small > blocks of the store files; when you ask for one record, it doesn't have to > read gigabytes of data from the disk, it can just read a small block. As > such, random small reads and writes on a very big data set can be done > efficiently. > > Furthermore, it's fine to update the data store frequently. For any given > record, you can make as many updates as you want to the in-memory > structures, and these aren't written to disk until the memory store is > flushed (and into the WAL, but that's also efficient b/c it's ordered by > update time, not record key). It all happens in memory, which is very fast > (but, again, it's safe b/c of the WAL). There are even some recent JIRAs > that make that process more efficient, by, for example, HBASE-4241< > https://issues.apache.org/jira/browse/HBASE-4241>. > > One way to think about it is that HBase is *precisely* a layer that adds > these efficient random read/write capabilities on top of the Hadoop > distributed file system (HDFS), and takes care of doing that in a way that > parallelizes nicely across a large cluster of machines, deals with machine > failures, etc. > > Ian > > On Jan 29, 2012, at 10:16 PM, Bing Li wrote: > > Dear Stack, > > Thanks so much for your reply! > > According to my understanding, in a large scale distributed system, it > prefers write-once-read-many. Frequent-updating must bring heavy load for > the consistency issue and the performance must be lowered. HBase must not > be suitable to be updated frequently, right? > > Best regards, > Bing > > On Mon, Jan 30, 2012 at 1:51 PM, Stack <st...@duboce.net<mailto: > st...@duboce.net>> wrote: > > On Sun, Jan 29, 2012 at 12:02 PM, Bing Li <lbl...@gmail.com<mailto: > lbl...@gmail.com>> wrote: > Another question is whether it is proper to update data in HBase > frequently? > > > This is 'normal', yes. > St.Ack > > >