[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608185#comment-16608185
 ] 

stack commented on HBASE-21035:
-------------------------------

I'm back. 

My master is aborting after spending fours reconstructing the assignment state 
from the reading of 300+ WALs. Master then becomes active. In background there 
are procedures running and finishing... mostly SCPs.

Then my master is dying with:

2018-09-07 22:21:58,968 ERROR org.apache.hadoop.hbase.master.HMaster: ***** 
ABORTING master vc0207.halxg.cloudera.com,22001,1536380265734: Unhandled 
exception. Starting shutdown. *****
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=31, exceptions:
Fri Sep 07 22:21:58 PDT 2018, null, java.net.SocketTimeoutException: 
callTimeout=60000, callDuration=69864: 
org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online 
on vd0412.halxg.cloudera.com,22101,1536380043533

i.e. meta is not online even though the above server's SCP has completed.

This is a dirty install so what was up in zk for meta location could be long 
stale.... but here we have a state where no SCP's running and meta is not 
online.

I need to write the tool to insert a meta assign. It just takes 4 or 5 hours 
before I know if it is the fix for this problem. And then there is the scan of 
the hbase:namespace table next.

Thinking of waiting on all SCPs to finish before we do our first meta scan .... 
and if meta is still not online, then, auto-schedule the restore meta procedure 
... splitting meta logs inline and then assigning meta. Would this violate your 
principal [~Apache9]?

In other words, I need to write the restore meta procedure -- it would split 
meta logs and then do the assign of meta -- but I think we should auto-schedule 
it in the case above.

> Meta Table should be able to online even if all procedures are lost
> -------------------------------------------------------------------
>
>                 Key: HBASE-21035
>                 URL: https://issues.apache.org/jira/browse/HBASE-21035
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>         Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to