[
https://issues.apache.org/jira/browse/HBASE-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Zhang resolved HBASE-28420.
-------------------------------
Fix Version/s: 2.7.0
3.0.0-beta-2
2.6.1
2.5.9
Hadoop Flags: Reviewed
Resolution: Fixed
Pushed to all active branches.
Thanks [~umesh9414] for contributing!
> Aborting Active HMaster is not rejecting remote Procedure Reports
> -----------------------------------------------------------------
>
> Key: HBASE-28420
> URL: https://issues.apache.org/jira/browse/HBASE-28420
> Project: HBase
> Issue Type: Bug
> Components: master, proc-v2
> Affects Versions: 2.4.17, 2.5.8
> Reporter: Umesh Kumar Kumawat
> Assignee: Umesh Kumar Kumawat
> Priority: Critical
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> When the Active Hmaster is in the process of abortion and another HMaster is
> becoming Active HMaster,at the same time if any region server reports the
> completion of the remote procedure, it generally goes to the old active
> HMaster because of the cached value of rssStub ->
> [code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2829]
> ([caller
> method|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3941]).
> On the Master side
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381]),
> It did check if the service is started but that returns true if the master
> is in the process of abortion(I didn't see when we are setting this flag
> false while abortion).
> This issue becomes *critical* when *ServerCrash of meta hosting RS and master
> failover* happens at the same time and hbase:meta got stuck in the offline
> state.
> Logs for abortion start of HMaster
> {noformat}
> 2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING
> master server4-1xxx,61000,1705169084562:
> FAILED persisting region=52d36581218e00a2668776cfea897132 state=CLOSING
> *****{noformat}
> {noformat}
> 2024-02-02 07:33:40,999 INFO [master/server4-1xxx:61000]
> regionserver.HRegionServer - Exiting;
> stopping=hbase2b-mnds4-1-ia2.ops.sfdc.net,61000,1705169084562; zookeeper
> connection closed.{noformat}
> it took almost 30 seconds to abort the HMaster.
>
> Logs of starting SCP for meta carrying host. (This SCP is started by the new
> active HMaster)
> {noformat}
> 2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster]
> assignment.AssignmentManager - Scheduled
> ServerCrashProcedure pid=3305546 for server5-1xxx61020,1706857451955
> (carryingMeta=true) server5-1-
> xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write
>
> locks = 1, Read locks = 0], oldState=ONLINE.{noformat}
> initialization of remote procedure
> {noformat}
> 2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor -
> Initialized subprocedures=[{pid=3305548,
> ppid=3305547, state=RUNNABLE; SplitWALRemoteProcedure server5-1-
> xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta,
> worker=server4-1-xxxx,61020,1705169180881}]{noformat}
> Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000)
> (in the process of abortion)
> {noformat}
> 2024-02-02 07:33:37,990 DEBUG
> [r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote
> procedure
> done, pid=3305548{noformat}
> This should be handled by the new active HMaster so that it can wake up the
> suspended Procedure on the new Active Hmaster. As the new ActiveHMaster was
> not able to wake that up, SCP procedure got stuck thus meta stayed OFFLINE.
>
> Logs of Hmaster trying to becomeActivehmaster but stuck-
> {noformat}
> 2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster]
> master.HMaster - hbase:meta,,1.1588230740
> is NOT online; state={1588230740 state=OPEN, ts=1706859212481,
> server=server5-1-xxx,61020,1706857451955};
> ServerCrashProcedures=true. Master startup cannot progress, in
> holding-pattern until region onlined.{noformat}
> After this master was stuck till we did hmaster failover to come out of this
> situation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)