Michal, Doing tests like that on such a small cluster is basically just asking for trouble ;)
So you might be hitting http://issues.apache.org/jira/browse/HBASE-2244 Also for clusters under 10 nodes you absolutely need hadoop 0.20.2, which has https://issues.apache.org/jira/browse/HDFS-872. The short story is that the Namenode will keep sending wrong information to the DFSClient the region server is using... also it has http://issues.apache.org/jira/browse/HDFS-101 Finally don't forget that hadoop 0.20 doesn't support fs sync, if you kill -9 the region server holding ROOT or META you might lose some rows if they are very recent. WRT to your very problem, unless we have all region server logs (you don't have that many), the master log and a timeline of when your coworker started shutting down stuff it's going to be hard to debug. thx, J-D 2010/3/5 Michał Podsiadłowski <podsiadlow...@gmail.com>: > Hi hbase-users! > > Yesterday we did quite important test. On our production environment we > introduced hbase as a webcache layer (first step in integrating it to our > env), and in controlled manner we tried to brake it ;). We started to stop > various elements starting from datanode, hregion etc.. Everyting was working > very nicely until my coworker started to simulated disaster - he shutdown > 2/3 of our cluster - 2 datanodes/hregions from 3. It was still fine though > query times were significantly higher - which wasn't surprise. > Then one of the hregions was started by watchdog and just after is stoodup > my friend invoked stop. Regions already started to be migrated to this node > and one of them was assigned by hmaster, opened on hregion (there is a > message in a log) but confirmation didn't arrive to Hmaster. Region location > was not saved to meta and this state was sustained till hmaster and the same > all regions restart. We couldn't scan it or get any row from that region, > nor disabled the table. It looked to us like master gave up tring to assing > the region or it assumed that regions was successfullly assinged and opened. > I know that scenario we simulated was not "normal usecase but stil we think > that cluster should recouperate after some time even from such a disaster. > Just to clarify all data from this table were replicated so no blocks were > missing. > Our hbase is 0.20.3 from Cloudera and hadoop is 0.20.1 also clean Cloudera > release. (any patches are adviced ?) > > Our cluster is consisted of 4 physical machines divided by with xen: > 3 machines divided to datanodes + hregions - 4gb ram / zookeeper 512 mb > ram / our other app > + 4th machine divided to namenode 2gb / secondar namenode 2gb / hmaster 1gb > > Region causing problem is _old-home,,1267642988312. > > Some logs you can find here: > fwdn2 - Region server that was stoped during assiging regions - > http://pastebin.com/uL48KCjd > > Due to unknown reasons log from master is corrupted and after some point is > appears as @@@@@@@.. in vim, yesterday it was fine though. > What i saw there was something like this > > 10/03/04 11:24:18 INFO master.RegionManager: Assigning region > _old-home,,1267642988312 to fwdn1,60020,1267698243695 > and then nothing more about this region. > > Any help appreciated. > > Thanks > Michal >