[ 
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761187#comment-16761187
 ] 

stack commented on HBASE-21844:
-------------------------------

bq.  But the meta znode was never updated. 

On removal, did a SCP not trigger for the removed server?

The 3.0.x is HWX version? Its not hbase master, right? (Just trying to be 
clear).

bq. s mentioned in HBASE-21797 I think it's quite reasonable to try to make 
AMv2 more resilient. 

I don't disagree. Was thinking we could fix issues as they come up and in this 
way make it more resilient. Was trying to avoid band-aiding over problems.

bq. The common problem can be that proc WAL simply was not flushed due to an 
HDFS issue....

... and we should just re-run procedure steps already run?

You seem to be suggesting we use Procedures most of the time but then just look 
at master state in some circumstance?

I'm not suggesting run HBCK everyday. You can't really. HBCK2 doesn't work the 
way HBCK1 did presuming it able to make god(master)-like pronunciation on 
health -- thats Master's prerogative.

2.2 hbase replaces a bunch of the assign procedures w/ a new base type, FYI.


> Master could get stuck in initializing state while waiting for meta
> -------------------------------------------------------------------
>
>                 Key: HBASE-21844
>                 URL: https://issues.apache.org/jira/browse/HBASE-21844
>             Project: HBase
>          Issue Type: Bug
>          Components: master, meta
>    Affects Versions: 3.0.0
>            Reporter: Bahram Chehrazy
>            Assignee: Bahram Chehrazy
>            Priority: Major
>         Attachments: 
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance 
> of master getting into a state where the ZK says meta is OPEN, but the server 
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted 
> and the procWALs were corrupted). In this case the waitForMetaOnline never 
> returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS 
> outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/************:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, 
> server=*************,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, 
> but nevertheless the master should be able to recover during a restart by 
> initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to