[ https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762316#comment-16762316 ]
Bahram Chehrazy commented on HBASE-21844: ----------------------------------------- I rebooted the master and it got stuck in the same state again. Maybe the server "<server1>,16020,1549450371876" no longer owns meta. But the stale state in the ZK is lying to the master. 2019-02-06 17:50:25,785 INFO [master/**************:16000:becomeActiveMaster] assignment.AssignmentManager: Attach pid=32700, ppid=32695, state=WAITING:*REGION_STATE_TRANSITION_CONFIRM_OPENED*, hasLock=false; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN to rit=*OPENING*, location=*<server1>,16020,1549450371876*, table=hbase:meta, region=1588230740 to restore RIT 2019-02-06 17:50:25,867 INFO [master/****************:16000:becomeActiveMaster] master.ServerManager: Registering regionserver=*<server1>,16020,1549450371876* 2019-02-06 17:50:26,168 INFO [master/**************:16000:becomeActiveMaster] master.HMaster: hbase:meta {1588230740 state=OPENING, ts=1549451497525, server=*<server1>,16020,1549450371876*} 2019-02-06 17:50:30,760 WARN [master/************:16000:becomeActiveMaster] master.HMaster: hbase:meta,,1.1588230740 is NOT online; state={1588230740 state=OPENING, ts=1549451497525, server=*<server1>,16020,1549450371876*}; ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined. > Master could get stuck in initializing state while waiting for meta > ------------------------------------------------------------------- > > Key: HBASE-21844 > URL: https://issues.apache.org/jira/browse/HBASE-21844 > Project: HBase > Issue Type: Bug > Components: master, meta > Affects Versions: 3.0.0 > Reporter: Bahram Chehrazy > Assignee: Bahram Chehrazy > Priority: Major > Attachments: > 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch > > > If the active master crashes after meta server dies, there is a slight chance > of master getting into a state where the ZK says meta is OPEN, but the server > is dead and there is no active SCP to recover it (perhaps the SCP has aborted > and the procWALs were corrupted). In this case the waitForMetaOnline never > returns. > > We've seen this happening a few times when there had been a temporary HDFS > outage. Following log lines shows this state. > > 2019-01-17 18:55:48,497 WARN [master/************:16000:becomeActiveMaster] > master.HMaster: hbase:meta,,1.1588230740 is NOT online; state= > {1588230740 *state=*OPEN**, ts=1547780128227, > server=*************,16020,1547776821322} > ; *ServerCrashProcedures=false*. Master startup cannot progress, in > holding-pattern until region onlined. > > I'm still investigating why and how to prevent getting into this bad state, > but nevertheless the master should be able to recover during a restart by > initiating a new SCP to fix the meta. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)