For our system it is critical that there be no data loss and fast recovery time
if any node goes down.
We've recently updated the hbase-transactional-tableindexed extension to work
with the latest 0.89.20100726 version of HBase (still to be pushed).
All HBase tests are passing but then when we started to write our own and test
true sudden HRegionServer death we ran into trouble.
It seems that the HMaster does not recognize the kill even after many minutes.
Client requests are blocked and the log continues to repeat the logs below.
We realized that HBase's own tests that require RegionServer death use abort()
and not kill() which does enough cleanup to inadequately simulate a sudden
(e.g. JVM crash) death.
As an experiment I made HRegionServer.kill() public and modified
HBaseMiniCluster to call that from abort() instead. Now a test like
TestMasterTransitions will exhibit similar behaviour: The HMaster never
notices the RegionServer is gone.
Could it really be that sudden region server death is not handled in hbase?
Or more likely is this a failure of the testing framework to adequately
simulate kill -9?
James Kennedy
Project Manager
Troove Inc.
-------------------------------
[13/08/10 15:12:12] 259494 [n.serverMonitor] INFO
oop.hbase.master.ServerManager - 2 region servers, 0 dead, average load 3.5
[13/08/10 15:12:12] 259560 [ger.metaScanner] INFO
adoop.hbase.master.BaseScanner - RegionManager.metaScanner scanning meta
region {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: <>}
[13/08/10 15:12:12] 259561 [ger.metaScanner] WARN
adoop.hbase.master.BaseScanner - Scan one META region: {server:
10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: <>}
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:309)
at
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:857)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:725)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:253)
at $Proxy10.openScanner(Unknown Source)
at
org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:182)
at
org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
at
org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
at
org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:156)
at org.apache.hadoop.hbase.Chore.run(Chore.java:68)