Hey, Zookeeper is a pretty fundamental part of how we are making things happen in hbase. The problem is when you lose your session, this is how we synchronize between the master and the regionserver. At this point neither side knows what the other knows, and the safest thing is to abort the regionserver. Without that, we can end up with multiple region assignments which is pretty messy.
ZK is like DNS and the network, without it running, we are more or less in trouble. There is no effective difference between a crashed machine and one that is having network problems, so they are treated the same and recovery is the same. Having said that, the session timeout is set in hbase, and i think ships at 40 seconds or so. So it should take more than a minor problem or a few lost packets to induce a crash. Now having said that, if you are killing the entire ZK cluster and expecting HBase to be ok, that is not really what will happen. This is why ZK is run in a 2N+1 scenario, so you can do rolling reboots, and survive N machine loss. But ZK is requires to be up 24/7, luckily it is fairly reliable. With hdfs 0.21, at least we'll be able to have effective hlog recovery. Now, your specific problem looks like a common issue with the master and regionservers being confused about what type of server they are running. I don't personally run the indexed or transactional extensions (they are not as inherently scalable), so maybe someone else can chime in. -ryan On Fri, Oct 16, 2009 at 1:29 PM, Lucas Nazário dos Santos <nazario.lu...@gmail.com> wrote: > Hi, > > Today one regionserver crashed and I can't figure out why. Everything > started with the message "server,60020,1255644477834 znode expired". I'm > still running the cluster on little memory and swap is getting in my way > from time to time (it's rare but I need to fix it). Can it be the cause of > the error bellow? Do you think that five minutes is enough for the property > zookeeper.session.timeout? Why the message "wrong key class: > org.apache.hadoop.hbase.regionserver.HLogKey is not class"? > > My tests show that whenever zookeeper "shakes" the whole cluster goes down. > Shouldn't HBase be more robust regarding Zookeeper? Something like a retry > strategy... > > Lucas > > > > 2009-10-16 15:07:32,167 INFO org.apache.hadoop.hbase.master.ServerManager: 2 > region servers, 0 dead, average load 7.0 > 2009-10-16 15:07:32,537 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: 192.168.1.2:60020, > regionname: -ROOT-,,0, startKey: <>} > 2009-10-16 15:07:32,560 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > 192.168.1.2:60020, regionname: -ROOT-,,0, startKey: <>} complete > 2009-10-16 15:07:32,654 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scanning meta region {server: 192.168.1.3:60020, > regionname: .META.,,1, startKey: <>} > 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scan of 12 row(s) of meta region {server: > 192.168.1.3:60020, regionname: .META.,,1, startKey: <>} complete > 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner: All > 1 .META. region(s) scanned > 2009-10-16 15:08:09,551 INFO org.apache.hadoop.hbase.master.ServerManager: > server,60020,1255644477834 znode expired > 2009-10-16 15:08:09,605 INFO org.apache.hadoop.hbase.master.RegionManager: > -ROOT- region unset (but not set to be reassigned) > 2009-10-16 15:08:09,605 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of > server server,60020,1255644477834: logSplit: false, rootRescanned: false, > numberOfMetaRegions: 1, onlineMetaRegions.size(): 1 > 2009-10-16 15:08:09,623 INFO org.apache.hadoop.hbase.regionserver.HLog: > Splitting 20 hlog(s) in > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834 > 2009-10-16 15:08:09,841 WARN org.apache.hadoop.hbase.regionserver.HLog: > Exception processing > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255644478353 > -- continuing. Possible DATA LOSS! > java.io.IOException: wrong key class: > org.apache.hadoop.hbase.regionserver.HLogKey is not class > org.apache.hadoop.hbase.regionserver.transactional.THLogKey > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) > at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896) > at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802) > at > org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274) > at > org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425) > 2009-10-16 15:08:09,870 WARN org.apache.hadoop.hbase.regionserver.HLog: > Exception processing > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255648058463 > -- continuing. Possible DATA LOSS! > java.io.IOException: wrong key class: > org.apache.hadoop.hbase.regionserver.HLogKey is not class > org.apache.hadoop.hbase.regionserver.transactional.THLogKey > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) > at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896) > at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802) > at > org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274) > at > org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425) > 2009-10-16 15:08:09,886 WARN org.apache.hadoop.hbase.regionserver.HLog: > Exception processing hdfs://server2:9000/hbase/.logs/server,60020,12556 > > // More wrong key class errors... > > 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.regionserver.HLog: hlog > file splitting completed in 594 millis for > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834 > 2009-10-16 15:08:10,203 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: Log split complete, > meta reassignment and scanning: > 2009-10-16 15:08:10,203 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: ProcessServerShutdown > reassigning ROOT region > 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.master.RegionManager: > -ROOT- region unset (but not set to be reassigned) > 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.master.RegionManager: > ROOT inserted into regionsInTransition > 2009-10-16 15:08:32,167 INFO org.apache.hadoop.hbase.master.ServerManager: 1 > region servers, 1 dead, average load 6.0[server,60020,1255644477834] >