Very long time between node failure and reasing of regions.

Michał Podsiadłowski Mon, 26 Apr 2010 09:14:27 -0700

Hi hbase users,

during our tests on production environment we found few really big
problems that stops us from using hbase. First major problem is
availability: we have now 6 regions servers + 2 masters + 3 zk. When
we shutdown normally one region servers it takes about 3-4 minutes or
longer depends on previous load till master will reassign missing
regions to alive rs. On regions servers there is usually less then 100
regions. In master logs we can see some log splitting and then long
brake and start of reassignning that also can take long time
especially when cluster is under load. This is way to long we can wait
because during that time requests to website are not processed.
Additional very unfortunate situation happened when my friend shutdown
3 out of 6 nodes - master started to do the job but something went
terribly wrong and it started to throw NPE's like mad.
Here is beginning of disaster : http://pastebin.com/1uh1x1fL after we
killed this server second one pick up and manage to start but with
only 91 out of 306 regions and after some long time.
Another big problem is that table connections in some circumstances
hangs no error thrown. Web servers request processing threadpool
quickly runs out of threads and no request are processed and watchdog
kills the server.



for those who want more lecture : http://pastebin.com/UaEPT6nc master
log from beginning of test
and second master log http://pastebin.com/shpcDWBn


Any help appreciated.
Thanks, Michal

Very long time between node failure and reasing of regions.

Reply via email to