[ 
https://issues.apache.org/jira/browse/HBASE-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-28420.
-------------------------------
    Fix Version/s: 2.7.0
                   3.0.0-beta-2
                   2.6.1
                   2.5.9
     Hadoop Flags: Reviewed
       Resolution: Fixed

Pushed to all active branches.

Thanks [~umesh9414] for contributing!

> Aborting Active HMaster is not rejecting remote Procedure Reports
> -----------------------------------------------------------------
>
>                 Key: HBASE-28420
>                 URL: https://issues.apache.org/jira/browse/HBASE-28420
>             Project: HBase
>          Issue Type: Bug
>          Components: master, proc-v2
>    Affects Versions: 2.4.17, 2.5.8
>            Reporter: Umesh Kumar Kumawat
>            Assignee: Umesh Kumar Kumawat
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> When the Active Hmaster is in the process of abortion and another HMaster is 
> becoming Active HMaster,at the same time if any region server reports the 
> completion of the remote procedure, it generally goes to the old active 
> HMaster because of the cached value of rssStub -> 
> [code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2829]
>  ([caller 
> method|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3941]).
>  On the Master side 
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381]),
>  It did check if the service is started but that returns true if the master 
> is in the process of abortion(I didn't see when we are setting this flag 
> false while abortion).  
> This issue becomes *critical* when *ServerCrash of meta hosting RS and master 
> failover* happens at the same time and hbase:meta got stuck in the offline 
> state.
> Logs for abortion start of HMaster 
> {noformat}
> 2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING 
> master server4-1xxx,61000,1705169084562:
> FAILED persisting region=52d36581218e00a2668776cfea897132 state=CLOSING 
> *****{noformat}
> {noformat}
> 2024-02-02 07:33:40,999 INFO [master/server4-1xxx:61000] 
> regionserver.HRegionServer - Exiting; 
> stopping=hbase2b-mnds4-1-ia2.ops.sfdc.net,61000,1705169084562; zookeeper 
> connection closed.{noformat}
> it took almost 30 seconds to abort the HMaster.
>  
> Logs of starting SCP for meta carrying host. (This SCP is started by the new 
> active HMaster)
> {noformat}
> 2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster] 
> assignment.AssignmentManager - Scheduled
> ServerCrashProcedure pid=3305546 for server5-1xxx61020,1706857451955 
> (carryingMeta=true) server5-1-
> xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write
>  
> locks = 1, Read locks = 0], oldState=ONLINE.{noformat}
> initialization of remote procedure
> {noformat}
> 2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[{pid=3305548, 
> ppid=3305547, state=RUNNABLE; SplitWALRemoteProcedure server5-1-
> xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta, 
> worker=server4-1-xxxx,61020,1705169180881}]{noformat}
> Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000) 
> (in the process of abortion)
> {noformat}
> 2024-02-02 07:33:37,990 DEBUG 
> [r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote 
> procedure 
> done, pid=3305548{noformat}
> This should be handled by the new active HMaster so that it can wake up the 
> suspended Procedure on the new Active Hmaster. As the new ActiveHMaster was 
> not able to wake that up, SCP procedure got stuck thus meta stayed OFFLINE. 
>  
> Logs of Hmaster trying to becomeActivehmaster but stuck-
> {noformat}
> 2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster] 
> master.HMaster - hbase:meta,,1.1588230740 
> is NOT online; state={1588230740 state=OPEN, ts=1706859212481, 
> server=server5-1-xxx,61020,1706857451955}; 
> ServerCrashProcedures=true. Master startup cannot progress, in 
> holding-pattern until region onlined.{noformat}
> After this master was stuck till we did hmaster failover to come out of this 
> situation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to