[ https://issues.apache.org/jira/browse/HBASE-19121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616214#comment-16616214 ]
Duo Zhang commented on HBASE-19121: ----------------------------------- In HBASE-21035, [~allan163], [~stack] and me have discussed a lot on how to get the cluster back when the procedure wals are broken, or simply say, how to make meta online, and then process the dead servers and assign regions, after we remove all the procedure wals. We finally (maybe) reached an agreement that these recovery work should be done by the HBCK2 framework. So I think we need to do this: 1. Implement a RecoverMetaProcedure(not the old one), where we find the location of meta on zk, and check if it is still alive, by checking the ephemeral node on zk on something else. If not, split the log inline(do we really need to split? It only contains wals from meta...), and then assign meta to a live RS. 2. If meta is online, then we are able to execute SCPs. We can scan the wal directories and find the ones end with 'splitting' suffix, and schedule SCPs for them. We have procedure locks so usually this should not be a big problem to schedule duplicated SCPs for one RS(need to confirm that the procedure scheduler can work fine). 3. After all SCPs have been finished, we could still have some regions in an intermediate state. This is because we may also removed some TRSPs. We should have the ability to find out these regions and schedule TRSPs to bring them online. 4. How do deal with split and merge in the middle? No ideas for now... Notice that this is very important for bringing hbase2 into the production environment, as no one can guarantee that there is no bug in a project. In a long run, we will always hit some critical bugs, or operational accidents, which causes the cluster to trap into a status which can not be recovered automatically. HBCK2 is our last line of defense. > HBCK for AMv2 (A.K.A HBCK2) > --------------------------- > > Key: HBASE-19121 > URL: https://issues.apache.org/jira/browse/HBASE-19121 > Project: HBase > Issue Type: Bug > Components: hbck > Reporter: stack > Assignee: Umesh Agashe > Priority: Major > Attachments: hbase-19121.master.001.patch > > > We don't have an hbck for the new AM. Old hbck may actually do damage going > against AMv2. > Fix. -- This message was sent by Atlassian JIRA (v7.6.3#76005)