[jira] [Commented] (HBASE-25142) Auto-fix 'Unknown Server'

Michael Stack (Jira) Thu, 01 Oct 2020 11:56:56 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-25142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205766#comment-17205766
 ]


Michael Stack commented on HBASE-25142:
---------------------------------------

At a minimum, we might add to 'hbck2 fixMeta' the scheduling of SCPs for all 
servers in 'Unknown Servers' list.

> Auto-fix 'Unknown Server'
> -------------------------
>
>                 Key: HBASE-25142
>                 URL: https://issues.apache.org/jira/browse/HBASE-25142
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Michael Stack
>            Priority: Major
>
> Addressing reports of 'Unknown Server' has come up in various conversations 
> lately. This issue is about fixing instances of 'Unknown Server' 
> automatically as part of the tasks undertaken by CatalogJanitor when it runs.
> First though, would like to figure a definition for 'Unknown Server' and a 
> list of ways in which they arise. We need this to figure how to do safe 
> auto-fixing.
> Currently an 'Unknown Server' is a server found in hbase:meta that is not 
> online (no recent heartbeat) and that is not mentioned in the dead servers 
> list.
> In outline, I'd think CatalogJanitor could schedule an expiration of the RS 
> znode in zk (if exists) and then an SCP if it finds an 'Unknown Server'. 
> Perhaps it waits for 2x or 10x the heartbeat interval just-in-case (or not). 
> The SCP would clean up any references in hbase:meta by reassigning Regions 
> assigned the 'Unknown Server' after replaying any WALs found in hdfs 
> attributed to the dead server.
> As to how they arise:
>  * A contrived illustration would be a large online cluster crashes down with 
> a massive backlog of WAL files – zk went down for some reason say. The replay 
> of the WALs look like it could take a very long time  (lets say the cluster 
> was badly configured and a bug and misconfig made it so each RS was carrying 
> hundreds of WALs and there are hundreds of servers). To get the service back 
> online, the procedure store and WALs are moved aside (for later replay with 
> WALPlayer). The cluster comes up. meta is onlined but refers to server 
> instances that are no longer around. Can schedule an SCP per server mentioned 
> in the 'HBCK Report' by scraping and scripting hbck2 or, better, 
> catalogjanitor could just do it.
>  * HBASE-24286 HMaster won't become healthy after after cloning... describes 
> starting a cluster over data that is hfile-content only. In this case the 
> original servers used manufacture the hfile cluster data are long dead yet 
> meta still refers to the old servers. They will not make the 'dead servers' 
> list.
> Let this issue stew awhile. Meantime collect how 'Unknown Server' gets 
> created and best way to fix.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-25142) Auto-fix 'Unknown Server'

Reply via email to