> Hi Michal, > > I agree that this use case should not cause the cluster to fail. By "just > shutting down" do you mean you are running hbase-daemon.sh stop regionserver > on 3 of the nodes? Are you doing all three at once or in quick succession? > I'd like to try to reproduce your problem so we can get it fixed for 0.20.5. > > Thanks > -Todd >
Todd, James, I'm in the office now so I can tell you more about our cluster: exact version of hbase is 0.20.4-dev, r926930 hadoop underneath is 0.20.2+228 + patches from cloudera hbase-patches-0b8ca9b5.tar.gz 22-Mar-2010 17:45 33K as far as i can see there is already something new available. We shutdown nodes one by one by moving from one shell to the other. We are doing this just like you wrote but using cloudera wrapping scripts. So in theory everything should be stable and there shouldn't be any data loss. Practice shows ... In your environment we are using this for L2 web cache layer with mem based layer in each applications server. We have plans for more use cases - storing some of users data there instead of MySql. But of course we need to by sure about it's stability and availability which are by now dubious for us. This implies that hbase must be available for most of the times otherwise we can't store invalidations to data which were changed by users requests. If hbase is not available we still can generate page which sometimes is very expensive or if available in mem cache layer serve it from there. Posting request can be rejected for short time but short is not few minutes. The worst scenario is when hbase connections hangs and threadpool gets filled up as i wrote earlier. In respect to what James wrote about not touching anything when working. This is not the idea that we want to use. First of all we can stop all request to hbase and do maintenance then because we need to clear all the date then because invalidations were not saved to it. Alternative is to stop whole site which is impermissible from marketing point of view and also whole thing contradict with HA idea. One of the base assumptions of hadoop is to utilize commodity hardware that is likely to fail when there is more then few nodes. There is also need for doing updates of hadoop and hbase software because it's a bleeding edge of this technology and updates are released quite frequently. There are also JVM updates, kernel updates etc. so system must be resistant to loss of one of the nodes. (We already manage to make disaster with only one node see my earlier post on group). Todd if you need more logs drop me an email. I can provide you will all logs from hbase and hadoop. Thanks, Michal.