Hey Baggio, Looks like you've done some good analysis. Much of what you've mentioned under HBase is in the works (multi-thread compactions, distributed log splitting, HBCK tool).
I would definitely recommend upgrading to 0.90 when it is released, there are some good fixes related to exception handling and DFS errors. The according HDFS releases (CDH3 or 20-append) provide true durability. Thanks for sharing! JG > -----Original Message----- > From: baggio liu [mailto:baggi...@gmail.com] > Sent: Monday, December 13, 2010 8:45 AM > To: user@hbase.apache.org > Subject: Re: HBase stability > > Hi Anze, > Our production cluster used HBase 0.20.6 and hdfs (CDH3b2), and we work > for stability about a month. Some issue we have been met, and may helpful > to you. > > HDFS: > 1. hbase file has short life cycle than map-red, some times there're many > blocks should be delete, we should tuning for the speed of hdfs invalid block. > 2. hadoop 0.20 branch can not deal with disk failure, HDFS-630 will be > helpful. > 3. region server can not deal IOException rightly. When DFSClient meet > network error, it'll throw IOException, and it may be not fatal for region > server, so these IOException MUST be review. > 4. In large scale scan, there're many concurrent reader in a short time. > We must make datanode dataxceiver number to a large number, and file > handle limit should be tuning. In addition, the connection reuse between > DFSClient and datanode should be done. > > HBase > 1. single thread compaction limit the speed of compaction, it should be > made multi-thread.( during multi-thread compaction we should limit network > bandwidth in compaction ) > 2. single thread split HLog (read HLog) wile make Hbase down time longer, > make it multi-thread can limit HBase down time. > 3. Additional, some tools should be done such as meta region checker, > fixer and so on. > 4. zookeeper session timeout should be tuning according to your load on > HBase cluster. > 5. gc stratigy should be tuning on your region server/HMaster. > > Beside upon, in production cluster, data loss issue should be fix as > while.(currently hadoop 0.20 append branch and CDH3b2 hadoop can be > used.) > Because of hdfs make many optimization on throughput, for application > like HBase (many random read/write) . Many tuning and change on hdfs > should be done. > Hope this experience can be helpful to you. > > > Thanks & Best regard > Baggio > > > 2010/12/14 Todd Lipcon <t...@cloudera.com> > > > HI Anze, > > > > In word, yes - 0.20.4 is not that stable in my experience, and > > upgrading to the latest CDH3 beta (which includes HBase 0.89.20100924) > > should give you a huge improvement in stability. > > > > You'll still need to do a bit of tuning of settings, but once it's > > well tuned it should be able to hold up under load without crashing. > > > > -Todd > > > > On Mon, Dec 13, 2010 at 2:41 AM, Anze <anzen...@volja.net> wrote: > > > Hi all! > > > > > > We have been using HBase 0.20.4 (cdh3b1) in production on 2 nodes > > > for a > > few > > > months now and we are having constant issues with it. We fell over > > > all standard traps (like "Too many open files", network > > > configuration problems,...). All in all, we had about one crash every week > or so. > > > Fortunately we are still using it just for background processing so > > > our service didn't suffer directly, but we have lost huge amounts of > > > time > > just > > > fixing the data errors that resulted from data not being written to > > permanent > > > storage. Not to mention fixing the issues. > > > As you can probably understand, we are very frustrated with this and > > > are seriously considering moving to another bigtable. > > > > > > Right now, HBase crashes whenever we run very intensive rebuild of > > secondary > > > index (normal table, but we use it as secondary index) to a huge > > > table. I > > have > > > found this: > > > http://wiki.apache.org/hadoop/Hbase/Troubleshooting > > > (see problem 9) > > > One of the lines read: > > > "Make sure you give plenty of RAM (in hbase-env.sh), the default of > > > 1GB > > won't > > > be able to sustain long running imports." > > > > > > So, if I understand correctly, no matter how HBase is set up, if I > > > run an intensive enough application, it will choke? I would expect > > > it to be > > slower > > > when under (too much) pressure, but not to crash. > > > > > > Of course, we will somehow solve this issue (working on it), but... > > > :( > > > > > > What are your experiences with HBase? Is it stable? Is it just us > > > and the > > way > > > we set it up? > > > > > > Also, would upgrading to 0.89 (cdh3b3) help? > > > > > > Thanks, > > > > > > Anze > > > > > > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > >