[ https://issues.apache.org/jira/browse/HBASE-24286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162302#comment-17162302 ]
Tak-Lon (Stephen) Wu commented on HBASE-24286: ---------------------------------------------- [PR#2114|https://github.com/apache/hbase/pull/2114] is for the master branch > HMaster won't become healthy after after cloning or creating a new cluster > pointing at the same file system > ----------------------------------------------------------------------------------------------------------- > > Key: HBASE-24286 > URL: https://issues.apache.org/jira/browse/HBASE-24286 > Project: HBase > Issue Type: Bug > Components: master, Region Assignment > Affects Versions: 3.0.0-alpha-1, 2.2.3, 2.2.4, 2.2.5 > Reporter: Jack Ye > Assignee: Tak-Lon (Stephen) Wu > Priority: Major > > h1. How to reproduce: > # user starts an HBase cluster on top of a file system > # user performs some operations and shuts down the cluster, all the data are > still persisted in the file system > # user creates a new HBase cluster using a different set of servers on top > of the same file system with the same root directory > # HMaster cannot initialize > h1. Root cause: > During HMaster initialization phase, the following happens: > # HMaster waits for namespace table online > # AssignmentManager gets all namespace table regions info > # region servers of namespace table are already dead, online check fails > # HMaster waits for namespace regions online, keep retrying for 1000 times > which means forever > Code waiting for namespace table to be online: > https://github.com/apache/hbase/blob/rel/2.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1102 > h1. Stack trace (running on S3): > 2020-04-23 08:15:57,185 WARN [master/ip-10-12-13-14:16000:becomeActiveMaster] > master.HMaster: > hbase:namespace,,1587628169070.d34b65b91a52644ed3e77c5fbb065c2b. is NOT > online; state=\{d34b65b91a52644ed3e77c5fbb065c2b state=OPEN, > ts=1587629742129, server=ip-10-12-13-14.ec2.internal,16020,1587628031614}; > ServerCrashProcedures=false. Master startup cannot progress, in > holding-pattern until region onlined. > where ip-10-12-13-14.ec2.internal is the old region server hosting the region > of hbase:namespace. > h1. Discussion for the fix > We see there is a fix for this at branch-3: > https://issues.apache.org/jira/browse/HBASE-21154. Before we provide a patch, > we would like to know from the community if we should backport this change to > branch-2, or if we should just perform a fix with minimum code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)