There is data loss when master failovers
----------------------------------------
Key: HBASE-4511
URL: https://issues.apache.org/jira/browse/HBASE-4511
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Critical
Fix For: 0.92.0
It goes like this:
Master crashed , at the same time RS with meta is crashing, but RS doesn't
eixt.
Master startups again and finds all living RS.
Master verifies the meta failed, because this RS is crashing.
Master reassigns the meta, but it doesn't split the Hlog.
So some meta data is loss.
About the logs of a failover test case fail.
//It said that we want to kill a RS
2011-09-28 19:54:45,694 INFO [Thread-988] regionserver.HRegionServer(1443):
STOPPED: Killing for unit test
2011-09-28 19:54:45,694 INFO [Thread-988] master.TestMasterFailover(1007):
RS 192.168.2.102,54385,1317264874629 killed
//Rs didn't crash.
2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720]
master.HMaster(458): Registering server found up in zk:
192.168.2.102,54385,1317264874629
2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720]
master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720]
zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of
znode /hbase/unassigned/1028785192 because node does not exist (not an error)
2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720]
zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of
data from znode /hbase/root-region-server and set watcher;
192.168.2.102,54383,131726487...
//Meta verification failed and ressigned the meta. So all the regions in the
meta is loss.
2011-09-28 19:54:51,773 INFO [Master:0;192.168.2.102,54557,1317264885720]
catalog.CatalogTracker(476): Failed verification of .META.,,1 at
address=192.168.2.102,54385,1317264874629;
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException:
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server
192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720]
catalog.CatalogTracker(316): new .META. server:
192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720]
zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of
data from znode /hbase/root-region-server and set watcher;
192.168.2.102,54383,131726487...
2011-09-28 19:54:52,277 INFO [Master:0;192.168.2.102,54557,1317264885720]
catalog.CatalogTracker(476): Failed verification of .META.,,1 at
address=192.168.2.102,54385,1317264874629;
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException:
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server
192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720]
catalog.CatalogTracker(316): new .META. server:
192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720]
zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of
data from znode /hbase/root-region-server and set watcher;
192.168.2.102,54383,131726487...
2011-09-28 19:54:52,782 INFO [Master:0;192.168.2.102,54557,1317264885720]
catalog.CatalogTracker(476): Failed verification of .META.,,1 at
address=192.168.2.102,54385,1317264874629;
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException:
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server
192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720]
catalog.CatalogTracker(316): new .META. server:
192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720]
zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating)
unassigned node for 1028785192 with OFFLINE state
2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread]
zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received
ZooKeeper Event, type=NodeCreated, state=SyncConnected,
path=/hbase/unassigned/1028785192
//It said that Master clean the cluster.
2011-09-28 19:54:52,889 INFO [Master:0;192.168.2.102,54557,1317264885720]
master.AssignmentManager(383): Clean cluster startup. Assigning userregions
2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720]
zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing
unassigned nodes
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira