[ https://issues.apache.org/jira/browse/HBASE-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419556#comment-17419556 ]
Anoop Sam John commented on HBASE-26287: ---------------------------------------- Why the SSH for the old RS (where NS region) not kicking in? You lost the MasterProcWAL? I believe there is hbck2 option to assign NS region in such stuck state. We should use such tools in this case IMO > the initialization of master could not be completed when hbase:namesapce' > region is not online > ------------------------------------------------------------------------------------------------ > > Key: HBASE-26287 > URL: https://issues.apache.org/jira/browse/HBASE-26287 > Project: HBase > Issue Type: Improvement > Components: master > Affects Versions: 2.3.5 > Reporter: bolao > Priority: Major > > hbase cluster unexpected shuts down and then restart, we sometimes find the > master can't not initialize becouse of that it is stuck in isRegionOnline > methad for Hbase:namespace。we found the master and meta table think the > hbase:namespace region is online but it's regionserver is dead by viewing > logs of master,isRegionOnline print log for this every one minute and don't > do Nothing, I think we can remove record form assignmentManager's RegionState > and assign hbase:namespace to another regionserver, in order to make hbase > cluster recover without human intervention。i came to ask your advice, what do > you think? > {panel:title=the logs of master} > 2021-09-02 18:32:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] > WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) > -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT > online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, > ts=1630577738198, > server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870}; > ServerCrashProcedures=false. Master startup cannot progress, in > holding-pattern until region onlined. > 2021-09-02 18:33:01 [master/fx-hd-sc-hbase-master-0:16000.Chore.1] INFO > org.apache.hadoop.hbase.ChoreService.scheduleChore(157) -Chore ScheduledChore > name=fx-hd-sc-hbase-master-0.fx-hd-sc.fx-ns.svc.cluster.xjht,16000,1630401705440-ClusterStatusChore, > period=60000, unit=MILLISECONDS is enabled. > 2021-09-02 18:33:01 [master/fx-hd-sc-hbase-master-0:16000.Chore.1] INFO > org.apache.hadoop.hbase.ScheduledChore.run(172) -Chore: > fx-hd-sc-hbase-master-0.fx-hd-sc.fx-ns.svc.cluster.xjht,16000,1630401705440-ClusterStatusChore > missed its start time > 2021-09-02 18:33:41 [ProcExecTimeout] INFO > org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334) > -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown > servers > 2021-09-02 18:33:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] > WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) > -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT > online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, > ts=1630577738198, > server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870}; > ServerCrashProcedures=false. Master startup cannot progress, in > holding-pattern until region onlined. > 2021-09-02 18:34:31 [qtp780802740-4192] INFO http.requests.master.write(60) > -15.22.70.168 - - [02/Sep/2021:10:34:31 +0000] "GET > //15.22.70.168:1601/master-status HTTP/1.1" 200 54124 > 2021-09-02 18:34:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] > WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) > -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT > online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, > ts=1630577738198, > server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870}; > ServerCrashProcedures=false. Master startup cannot progress, in > holding-pattern until region onlined. > 2021-09-02 18:34:51 [qtp780802740-4202] INFO http.requests.master.write(60) > -15.22.70.168 - - [02/Sep/2021:10:34:51 +0000] "GET > //15.22.70.168:1601/master-status HTTP/1.1" 200 54122 > 2021-09-02 18:35:41 [ProcExecTimeout] INFO > org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334) > -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown > servers > 2021-09-02 18:35:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] > WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) > -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT > online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, > ts=1630577738198, > server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870}; > ServerCrashProcedures=false. Master startup cannot progress, in > holding-pattern until region onlined. > 2021-09-02 18:36:20 [qtp780802740-4192] INFO http.requests.master.write(60) > -15.22.70.168 - - [02/Sep/2021:10:36:20 +0000] "GET > //15.22.70.168:1601/master-status HTTP/1.1" 200 54122 > 2021-09-02 18:36:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] > WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) > -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT > online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, > ts=1630577738198, > server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870}; > ServerCrashProcedures=false. Master startup cannot progress, in > holding-pattern until region onlined. > 2021-09-02 18:36:57 [RSProcedureDispatcher-pool4-t23] WARN > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.scheduleForRetry(323) > -request to > fx-hd-sc-hbase-slave-15.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1630405458162 > failed due to org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to > fx-hd-sc-hbase-slave-15.fx-hd-sc.fx-ns.svc.cluster.xjht/172.49.9.38:16020 > failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: > Call[id=6192,methodName=ExecuteProcedures], waitTime=600008, > rpcTimeout=600000, try=7, retrying... > 2021-09-02 18:37:42 [ProcExecTimeout] INFO > org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334) > -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown > servers > 2021-09-02 18:37:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] > WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) > -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT > online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, > ts=1630577738198, > server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870}; > ServerCrashProcedures=false. Master startup cannot progress, in > holding-pattern until region onlined. > 2021-09-02 18:38:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] > WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) > -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT > online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, > ts=1630577738198, > server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870}; > ServerCrashProcedures=false. Master startup cannot progress, in > holding-pattern until region onlined. > 2021-09-02 18:38:49 [zk-event-processor-pool1-t1] INFO > org.apache.hadoop.hbase.security.token.ZKSecretWatcher.nodeDeleted(94) -Node > deleted id=168 > 2021-09-02 18:39:42 [ProcExecTimeout] INFO > org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334) > -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown > servers > 2021-09-02 18:39:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] > WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) > -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT > online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, > ts=1630577738198, > server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870}; > ServerCrashProcedures=false. Master startup cannot progress, in > holding-pattern until region onlined. > > {panel:title=the code of master} > https://github.com/apache/hbase/blob/fd3fdc08d1cd43eb3432a1a70d31c3aece6ecabe/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1214 > {panel} > > {panel} > -- This message was sent by Atlassian Jira (v8.3.4#803005)