[jira] [Commented] (HBASE-24286) HMaster won't become healthy after after cloning or creating a new cluster pointing at the same file system

Anoop Sam John (Jira) Sat, 02 May 2020 10:16:24 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-24286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098028#comment-17098028
 ]


Anoop Sam John commented on HBASE-24286:
----------------------------------------

On the new cluster the regions will get online by doing crash recovery of RSs 
in the old cluster. In 2.x this depends on  either the availability of the WAL 
dir of the RS in the old cluster or the SCP entry already got added into the 
MasterProcWAL file of the old cluster. Said that we need the old cluster's WAL 
dir also.  If u can backup the old cluster's WAL directory also and restore 
into the new cluster and then start the HBase, all will be ok.  I believe the 
WAL dir is backed by HDFS.

> HMaster won't become healthy after after cloning or creating a new cluster 
> pointing at the same file system
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-24286
>                 URL: https://issues.apache.org/jira/browse/HBASE-24286
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment
>    Affects Versions: 2.2.3
>            Reporter: Jack Ye
>            Priority: Major
>
> h1. How to reproduce:
>  # user starts an HBase cluster on top of a file system
>  # user performs some operations and shuts down the cluster, all the data are 
> still persisted in the file system
>  # user creates a new HBase cluster using a different set of servers on top 
> of the same file system with the same root directory
>  # HMaster cannot initialize
> h1. Root cause:
> During HMaster initialization phase, the following happens:
>  # HMaster waits for namespace table online
>  # AssignmentManager gets all namespace table regions info
>  # region servers of namespace table are already dead, online check fails
>  # HMaster waits for namespace regions online, keep retrying for 1000 times 
> which means forever
> Code waiting for namespace table to be online: 
> https://github.com/apache/hbase/blob/rel/2.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1102
> h1. Stack trace (running on S3):
> 2020-04-23 08:15:57,185 WARN [master/ip-10-12-13-14:16000:becomeActiveMaster] 
> master.HMaster: 
> hbase:namespace,,1587628169070.d34b65b91a52644ed3e77c5fbb065c2b. is NOT 
> online; state=\{d34b65b91a52644ed3e77c5fbb065c2b state=OPEN, 
> ts=1587629742129, server=ip-10-12-13-14.ec2.internal,16020,1587628031614}; 
> ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> where ip-10-12-13-14.ec2.internal is the old region server hosting the region 
> of hbase:namespace.
> h1. Discussion for the fix
> We see there is a fix for this at branch-3: 
> https://issues.apache.org/jira/browse/HBASE-21154. Before we provide a patch, 
> we would like to know from the community if we should backport this change to 
> branch-2, or if we should just perform a fix with minimum code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-24286) HMaster won't become healthy after after cloning or creating a new cluster pointing at the same file system

Reply via email to