Hi hbase users, during our tests on production environment we found few really big problems that stops us from using hbase. First major problem is availability: we have now 6 regions servers + 2 masters + 3 zk. When we shutdown normally one region servers it takes about 3-4 minutes or longer depends on previous load till master will reassign missing regions to alive rs. On regions servers there is usually less then 100 regions. In master logs we can see some log splitting and then long brake and start of reassignning that also can take long time especially when cluster is under load. This is way to long we can wait because during that time requests to website are not processed. Additional very unfortunate situation happened when my friend shutdown 3 out of 6 nodes - master started to do the job but something went terribly wrong and it started to throw NPE's like mad. Here is beginning of disaster : http://pastebin.com/1uh1x1fL after we killed this server second one pick up and manage to start but with only 91 out of 306 regions and after some long time. Another big problem is that table connections in some circumstances hangs no error thrown. Web servers request processing threadpool quickly runs out of threads and no request are processed and watchdog kills the server.
for those who want more lecture : http://pastebin.com/UaEPT6nc master log from beginning of test and second master log http://pastebin.com/shpcDWBn Any help appreciated. Thanks, Michal