[ 
https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-4511:
-------------------------

    Attachment: sketch.txt

I think there is a hole here as identified by Gaojinchao.  Its probably of rare 
incidence but its easy to reason about.

Say we fail verifying root region location when a new master joins an exisiting 
cluster, we will reassign it.  We may have failed though because the server has 
not yet expired in zk.  In this later case, the root will likely be assigned to 
a new server.  When the server that had been carrying the root eventually goes 
down, the shutdown handling code may treat it as a server that was carrying 
root (if root location has not yet been updated) or it may not (if root 
location has been updated).  In either case, its edits will likely be skipped 
when root opens in its new location.

Ditto for meta region.

Here is a sketch of a patch that does the following:

+ If on master start, verification of the root location fails AND the server we 
were verifying is a member of the online set, expire the server -- do not 
reassign root (let shutdown handler do the reassigning).
+ Ditto for meta (only don't re-expire a server if we just did it on root 
check).

Drastic but I think this 'correct'.

Patch is not working right though... need to check it out.... later.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 
> org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't 
> eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): 
> STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
> master.HMaster(458): Registering server found up in zk: 
> 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
> master.ServerManager(232): Registering 
> server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
> zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of 
> znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
> zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
> of data from znode /hbase/root-region-server and set watcher; 
> 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the 
> meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
> catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
> address=192.168.2.102,54385,1317264874629; 
> org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
> org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
> 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
> catalog.CatalogTracker(316): new .META. server: 
> 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
> zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
> of data from znode /hbase/root-region-server and set watcher; 
> 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
> catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
> address=192.168.2.102,54385,1317264874629; 
> org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
> org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
> 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
> catalog.CatalogTracker(316): new .META. server: 
> 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
> zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
> of data from znode /hbase/root-region-server and set watcher; 
> 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
> catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
> address=192.168.2.102,54385,1317264874629; 
> org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
> org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
> 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
> catalog.CatalogTracker(316): new .META. server: 
> 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
> zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or 
> updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] 
> zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received 
> ZooKeeper Event, type=NodeCreated, state=SyncConnected, 
> path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
> master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
> zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing 
> unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to