[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605123#comment-16605123
 ] 

stack commented on HBASE-21035:
-------------------------------

Late to the party....

I've been playing with removing the procedure WALs and doing other damage to 
the cluster to see how well we recover. I ran into this issue here that 
[~allan163] talks of where there is no assign for meta region on startup; in my 
case, I'd removed the procedure WAL dirs as Allan does in his test but 
different from Allan, there was no WAL dir for the server that had been 
carrying meta; I think it'd been removed across restarts (I didn't check) but I 
was able to repro by removing the empty WAL dir manually (empty because there'd 
been a clean shutdown).

After reading the above healthy back and forth, while the system seems pretty 
robust as is -- I had trouble breaking it removing stuff -- and Allan's patch 
would catch a particular case not covered now, I agree that we need the "assign 
meta" fix-it in our hbck2 vocabulary. Let me add scheduling of a meta assign 
(and log search and recovery) as a hbck2 option to the list of fix-its we need 
in hbck2 as we discussed in person a few weeks back. 

In my investigations, it seems like we need similar for hbase:namespace table. 
It can get banjaxed similarly and if not online the cluster is a mess.

Master initialization gets stuck trying to read from meta to populate the 
TableStates. It later gets stuck trying to initialize the TableNamespaceManager 
if the namespace table is not online.

I filed HBASE-21156

> Meta Table should be able to online even if all procedures are lost
> -------------------------------------------------------------------
>
>                 Key: HBASE-21035
>                 URL: https://issues.apache.org/jira/browse/HBASE-21035
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>         Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to