[ 
https://issues.apache.org/jira/browse/HBASE-19121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616214#comment-16616214
 ] 

Duo Zhang commented on HBASE-19121:
-----------------------------------

In HBASE-21035, [~allan163], [~stack] and me have discussed a lot on how to get 
the cluster back when the procedure wals are broken, or simply say, how to make 
meta online, and then process the dead servers and assign regions, after we 
remove all the procedure wals. We finally (maybe) reached an agreement that 
these recovery work should be done by the HBCK2 framework. So I think we need 
to do this:

1. Implement a RecoverMetaProcedure(not the old one), where we find the 
location of meta on zk, and check if it is still alive, by checking the 
ephemeral node on zk on something else. If not, split the log inline(do we 
really need to split? It only contains wals from meta...), and then assign meta 
to a live RS. 
2. If meta is online, then we are able to execute SCPs. We can scan the wal 
directories and find the ones end with 'splitting' suffix, and schedule SCPs 
for them. We have procedure locks so usually this should not be a big problem 
to schedule duplicated SCPs for one RS(need to confirm that the procedure 
scheduler can work fine).
3. After all SCPs have been finished, we could still have some regions in an 
intermediate state. This is because we may also removed some TRSPs. We should 
have the ability to find out these regions and schedule TRSPs to bring them 
online.
4. How do deal with split and merge in the middle? No ideas for now...

Notice that this is very important for bringing hbase2 into the production 
environment, as no one can guarantee that there is no bug in a project. In a 
long run, we will always hit some critical bugs, or operational accidents, 
which causes the cluster to trap into a status which can not be recovered 
automatically. HBCK2 is our last line of defense.



> HBCK for AMv2 (A.K.A HBCK2)
> ---------------------------
>
>                 Key: HBASE-19121
>                 URL: https://issues.apache.org/jira/browse/HBASE-19121
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck
>            Reporter: stack
>            Assignee: Umesh Agashe
>            Priority: Major
>         Attachments: hbase-19121.master.001.patch
>
>
> We don't have an hbck for the new AM. Old hbck may actually do damage going 
> against AMv2.
> Fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to