[
https://issues.apache.org/jira/browse/HBASE-28690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Zhang reopened HBASE-28690:
-------------------------------
> Aborting Active HMaster is not rejecting reportRegionStateTransition if
> procedure is initialised by next Active master
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-28690
> URL: https://issues.apache.org/jira/browse/HBASE-28690
> Project: HBase
> Issue Type: Bug
> Components: proc-v2
> Affects Versions: 2.5.8
> Reporter: Umesh Kumar Kumawat
> Assignee: Umesh Kumar Kumawat
> Priority: Major
> Labels: pull-request-available
>
> A CloseRegionProcedure on master requests the RS to close the region and
> after closing the region RS reports RegionStateTransition
> back([here|https://github.com/apache/hbase/blob/d1015a68ed9f94d74668abd37edefd32f5e9305b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L1853]).
> On receiving the report, the master checks if regionNode has any procedure
> assigned to it
> ([code|https://github.com/apache/hbase/blob/d1015a68ed9f94d74668abd37edefd32f5e9305b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1294]).
>
>
> {code:java}
> private boolean reportTransition(RegionStateNode regionNode, ServerStateNode
> serverNode,
> TransitionCode state, long seqId, long procId) throws IOException {
> ServerName serverName = serverNode.getServerName();
> TransitRegionStateProcedure proc = regionNode.getProcedure();
> if (proc == null) {
> return false;
> }
>
> proc.reportTransition(master.getMasterProcedureExecutor().getEnvironment(),
> regionNode,
> serverName, state, seqId, procId);
> return true;
> } {code}
> If regionNode doesn't have any procedure, the master just logs it and doesn't
> throw any error to RPC.
>
> Think of a case when MasterFailover is happening and the new Active master
> only initialized the TRSP and CloseRegionProcedure. Now aborting Master has
> stale/false data. If the transition report comes to the aborting master, not
> rejecting this report is causing the procedure to get stuck.
>
> *Logs for more understanding*
> active master server4-1 failing
> {noformat}
> 2024-06-20 04:45:05,576 ERROR
> [iority.RWQ.Fifo.write.handler=3,queue=0,port=61000] master.HMaster - *****
> ABORTING master server4-1,61000,1715413775736: Failed to record region server
> as started *****{noformat}
> *logs of new active master server5-1*
>
> {noformat}
> 2024-06-20 04:49:28,893 DEBUG [aster/server5-1:61000:becomeActiveMaster]
> assignment.RegionStateStore - Load hbase:meta entry
> region=888a715d5926adbb89c985d8967f40d4, regionState=OPEN,
> lastHost=server1-119,61020,1717560166420,
> regionLocation=server1-119,61020,1717560166420, openSeqNum=34892620
> 024-06-20 04:49:51,886 INFO [PEWorker-22] procedure2.ProcedureExecutor -
> Initialized subprocedures=[{pid=16276416, ppid=16276108,
> state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure
> table=RIMBS.UPLOADER_JOB_DETAILS, region=888a715d5926adbb89c985d8967f40d4,
> UNASSIGN}] (on server5-1)
> 2024-06-20 04:49:52,022 INFO [PEWorker-40] procedure2.ProcedureExecutor -
> Initialized subprocedures=[{pid=16276470, ppid=16276416, state=RUNNABLE;
> CloseRegionProcedure 888a715d5926adbb89c985d8967f40d4,
> server=server1-119,61020,1717560166420}] (on server5-1){noformat}
>
> *RS logs for closing*
> {noformat}
> 2024-06-20 04:49:52,267 INFO [_REGION-regionserver/server1-119:61020-2]
> handler.UnassignRegionHandler - Close 888a715d5926adbb89c985d8967f40d4
> 2024-06-20 04:49:52,267 DEBUG [_REGION-regionserver/server1-119:61020-2]
> regionserver.HRegion - Closing 888a715d5926adbb89c985d8967f40d4, disabling
> compactions & flushes
> 2024-06-20 04:49:52,354 INFO [_REGION-regionserver/server1-119:61020-2]
> regionserver.HRegion - Closed
> TABLE,KW\x00na240-app1-16\x00/Events-120620231740\x00MARKER-Events,1702619592612.888a715d5926adbb89c985d8967f40d4.
> {noformat}
> *Logs of report on aborting active Hmaster*
> {noformat}
> 2024-06-20 04:49:52,355 WARN
> [iority.RWQ.Fifo.write.handler=1,queue=0,port=61000]
> assignment.AssignmentManager - No matching procedure found for
> server1-119,61020,1717560166420 transition on state=OPEN,
> location=server1-119,61020,1717560166420, table=RIMBS.UPLOADER_JOB_DETAILS,
> region=888a715d5926adbb89c985d8967f40d4 to CLOSED ( host = server4-1 ,
> hbaseMasterLogFile){noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)