[ https://issues.apache.org/jira/browse/HBASE-25142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205766#comment-17205766 ]
Michael Stack commented on HBASE-25142: --------------------------------------- At a minimum, we might add to 'hbck2 fixMeta' the scheduling of SCPs for all servers in 'Unknown Servers' list. > Auto-fix 'Unknown Server' > ------------------------- > > Key: HBASE-25142 > URL: https://issues.apache.org/jira/browse/HBASE-25142 > Project: HBase > Issue Type: Improvement > Reporter: Michael Stack > Priority: Major > > Addressing reports of 'Unknown Server' has come up in various conversations > lately. This issue is about fixing instances of 'Unknown Server' > automatically as part of the tasks undertaken by CatalogJanitor when it runs. > First though, would like to figure a definition for 'Unknown Server' and a > list of ways in which they arise. We need this to figure how to do safe > auto-fixing. > Currently an 'Unknown Server' is a server found in hbase:meta that is not > online (no recent heartbeat) and that is not mentioned in the dead servers > list. > In outline, I'd think CatalogJanitor could schedule an expiration of the RS > znode in zk (if exists) and then an SCP if it finds an 'Unknown Server'. > Perhaps it waits for 2x or 10x the heartbeat interval just-in-case (or not). > The SCP would clean up any references in hbase:meta by reassigning Regions > assigned the 'Unknown Server' after replaying any WALs found in hdfs > attributed to the dead server. > As to how they arise: > * A contrived illustration would be a large online cluster crashes down with > a massive backlog of WAL files – zk went down for some reason say. The replay > of the WALs look like it could take a very long time (lets say the cluster > was badly configured and a bug and misconfig made it so each RS was carrying > hundreds of WALs and there are hundreds of servers). To get the service back > online, the procedure store and WALs are moved aside (for later replay with > WALPlayer). The cluster comes up. meta is onlined but refers to server > instances that are no longer around. Can schedule an SCP per server mentioned > in the 'HBCK Report' by scraping and scripting hbck2 or, better, > catalogjanitor could just do it. > * HBASE-24286 HMaster won't become healthy after after cloning... describes > starting a cluster over data that is hfile-content only. In this case the > original servers used manufacture the hfile cluster data are long dead yet > meta still refers to the old servers. They will not make the 'dead servers' > list. > Let this issue stew awhile. Meantime collect how 'Unknown Server' gets > created and best way to fix. > -- This message was sent by Atlassian Jira (v8.3.4#803005)