[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610503#comment-16610503
 ] 

stack commented on HBASE-21035:
-------------------------------

bq. stack, sir, I don't get it, how can SCP finish without meta online? No 
region to assign?

Its hard to trace why on my big cluster, but we somehow lose accounting of meta 
after a bunch of crashing and restarting of master. I've also been playing with 
the startup sequence which probably messed things up (Masters do not progress 
beyond waitForMasterActive -- they don't get to the run method it seems). 
Anyways, I can get into a state where all SCPs are done but meta is not online 
(Last night meta was in the OPENING state after all SCPs were done).

If meta is not online and we can't scan it to loadMeta, then the Master 
shutsdown after a minute. I'm working on having the Master hold before the 
first meta scan if it can't find an online meta for the operator to insert an 
assign at least (I think we might just auto-assign if all SCPs are done and 
there is still no meta online). We need the 'hold' at least. Replaying all the 
WALs can take a while. It would be frustrating to the operator watching 
hundreds of backed-up WALs replaying and then the Master exits when done.



> Meta Table should be able to online even if all procedures are lost
> -------------------------------------------------------------------
>
>                 Key: HBASE-21035
>                 URL: https://issues.apache.org/jira/browse/HBASE-21035
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>         Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to