Umesh Kumar Kumawat created HBASE-28420:
-------------------------------------------
Summary: Aborting Active HMaster is not rejecting remote Procedure
Reports
Key: HBASE-28420
URL: https://issues.apache.org/jira/browse/HBASE-28420
Project: HBase
Issue Type: Bug
Components: master, proc-v2
Affects Versions: 2.5.7
Reporter: Umesh Kumar Kumawat
Assignee: Umesh Kumar Kumawat
If the Active Hmaster is in the process of abortion and another HMaster is
becoming Active HMaster.If at the same time region server reports the
completion of the remote procedure, it generally goes to the old active HMaster
because of the cached value of rssStub ->
[code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2829]
([caller
method|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3941]).
On the Master side
([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381]),
It did check if the service is started but that returns true if the master is
in the process of abortion.
This issue becomes *critical* when *ServerCrash of meta hosting RS and master
failover* happens at the same time and hbase:meta got stuck in the offline
state.
Logs for abortion start of HMaster
{noformat}
2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING
master server4-1xxx,61000,1705169084562: FAILED persisting
region=52d36581218e00a2668776cfea897132 state=CLOSING *****{noformat}
Logs of starting SCP for meta carrying host
{noformat}
2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster]
assignment.AssignmentManager - Scheduled ServerCrashProcedure pid=3305546 for
server5-1xxx61020,1706857451955 (carryingMeta=true)
server5-1-xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write
locks = 1, Read locks = 0], oldState=ONLINE.{noformat}
initialization of remote procedure
{noformat}
2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor -
Initialized subprocedures=[{pid=3305548, ppid=3305547, state=RUNNABLE;
SplitWALRemoteProcedure
server5-1-xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta,
worker=server4-1-xxxx,61020,1705169180881}]{noformat}
Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000) (in
the process of abortion)
{noformat}
2024-02-02 07:33:37,990 DEBUG
[r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote
procedure done, pid=3305548{noformat}
Logs of Hmaster trying to becomeActivehmaster -
{noformat}
2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster]
master.HMaster - hbase:meta,,1.1588230740 is NOT online; state={1588230740
state=OPEN, ts=1706859212481, server=server5-1-xxx,61020,1706857451955};
ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern
until region onlined.{noformat}
After this master was stuck for almost 1 hour. We had to do hmaster failover to
come out of this situation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)