[ https://issues.apache.org/jira/browse/HBASE-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang resolved HBASE-28420. ------------------------------- Fix Version/s: 2.7.0 3.0.0-beta-2 2.6.1 2.5.9 Hadoop Flags: Reviewed Resolution: Fixed Pushed to all active branches. Thanks [~umesh9414] for contributing! > Aborting Active HMaster is not rejecting remote Procedure Reports > ----------------------------------------------------------------- > > Key: HBASE-28420 > URL: https://issues.apache.org/jira/browse/HBASE-28420 > Project: HBase > Issue Type: Bug > Components: master, proc-v2 > Affects Versions: 2.4.17, 2.5.8 > Reporter: Umesh Kumar Kumawat > Assignee: Umesh Kumar Kumawat > Priority: Critical > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.9 > > > When the Active Hmaster is in the process of abortion and another HMaster is > becoming Active HMaster,at the same time if any region server reports the > completion of the remote procedure, it generally goes to the old active > HMaster because of the cached value of rssStub -> > [code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2829] > ([caller > method|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3941]). > On the Master side > ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381]), > It did check if the service is started but that returns true if the master > is in the process of abortion(I didn't see when we are setting this flag > false while abortion). > This issue becomes *critical* when *ServerCrash of meta hosting RS and master > failover* happens at the same time and hbase:meta got stuck in the offline > state. > Logs for abortion start of HMaster > {noformat} > 2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING > master server4-1xxx,61000,1705169084562: > FAILED persisting region=52d36581218e00a2668776cfea897132 state=CLOSING > *****{noformat} > {noformat} > 2024-02-02 07:33:40,999 INFO [master/server4-1xxx:61000] > regionserver.HRegionServer - Exiting; > stopping=hbase2b-mnds4-1-ia2.ops.sfdc.net,61000,1705169084562; zookeeper > connection closed.{noformat} > it took almost 30 seconds to abort the HMaster. > > Logs of starting SCP for meta carrying host. (This SCP is started by the new > active HMaster) > {noformat} > 2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster] > assignment.AssignmentManager - Scheduled > ServerCrashProcedure pid=3305546 for server5-1xxx61020,1706857451955 > (carryingMeta=true) server5-1- > xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write > > locks = 1, Read locks = 0], oldState=ONLINE.{noformat} > initialization of remote procedure > {noformat} > 2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor - > Initialized subprocedures=[{pid=3305548, > ppid=3305547, state=RUNNABLE; SplitWALRemoteProcedure server5-1- > xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta, > worker=server4-1-xxxx,61020,1705169180881}]{noformat} > Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000) > (in the process of abortion) > {noformat} > 2024-02-02 07:33:37,990 DEBUG > [r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote > procedure > done, pid=3305548{noformat} > This should be handled by the new active HMaster so that it can wake up the > suspended Procedure on the new Active Hmaster. As the new ActiveHMaster was > not able to wake that up, SCP procedure got stuck thus meta stayed OFFLINE. > > Logs of Hmaster trying to becomeActivehmaster but stuck- > {noformat} > 2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster] > master.HMaster - hbase:meta,,1.1588230740 > is NOT online; state={1588230740 state=OPEN, ts=1706859212481, > server=server5-1-xxx,61020,1706857451955}; > ServerCrashProcedures=true. Master startup cannot progress, in > holding-pattern until region onlined.{noformat} > After this master was stuck till we did hmaster failover to come out of this > situation. -- This message was sent by Atlassian Jira (v8.20.10#820010)