[ https://issues.apache.org/jira/browse/HBASE-28405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821350#comment-17821350 ]
Aman Poonia commented on HBASE-28405: ------------------------------------- [~zhangduo] Thanks for the insight. {noformat} The logic at RS side is that, at the end of assign a region, it will retry forever on reporting this to master. So if we find out that the region is already online, we should just ignore it, as we can make sure that there is someone else will finally report it to master, to avoid double report and cause issues. {noformat} When i checked the log of OpenRegionProcedure on RS there were no logs for that rpocedure. Similarly when we look at the master logs there were no logs about the parent procedure and its progress. SO we were stuck in this state infinitely One another though we had to execute TRSP but maybe not the assign because the state of region was merging in region state node. This is the difference. when we check for state in region we use regionstatenode and when we check if region is online on RS we use the onlineregions map of RS to see if region is online. So basically we are looking at two different places in same flow. Maybe since region is online we just change the state in region state node (meta) from MERGING to OPEN > Region open procedure silently returns without notifying the parent proc > ------------------------------------------------------------------------ > > Key: HBASE-28405 > URL: https://issues.apache.org/jira/browse/HBASE-28405 > Project: HBase > Issue Type: Bug > Components: proc-v2 > Affects Versions: 2.5.7 > Reporter: Aman Poonia > Assignee: Aman Poonia > Priority: Major > > *We had a scenario in production where a merge operation had failed as below* > _2024-02-11 10:53:57,715 ERROR [PEWorker-31] > assignment.MergeTableRegionsProcedure - Error trying to merge > [a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b] in > table1 (in state=MERGE_TABLE_REGIONS_CLOSE_REGIONS)_ > _org.apache.hadoop.hbase.HBaseIOException: The parent region state=MERGING, > location=rs-229,60020,1707587658182, table=table1, > region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up_ > _at > org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.createUnassignProceduresForSplitOrMerge(AssignmentManagerUtil.java:120)_ > _at > org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.createUnassignProcedures(MergeTableRegionsProcedure.java:648)_ > _at > org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:205)_ > _at > org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:79)_ > _at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)_ > _at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:922)_ > _at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)_ > _at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)_ > _at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)_ > _at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1964)_ > _at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)_ > _at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1991)_ > *Now when we do rollback of failed merge operation we see a issue where > region is in state opened until the RS holding it stopped.* > Rollback create a TRSP as below > _2024-02-11 10:53:57,719 DEBUG [PEWorker-31] procedure2.ProcedureExecutor - > Stored [pid=26674602, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; > TransitRegionStateProcedure table=table1, > region=a92008b76ccae47d55c590930b837036, ASSIGN]_ > *and rollback finished successfully* > _2024-02-11 10:53:57,721 INFO [PEWorker-31] procedure2.ProcedureExecutor - > Rolled back pid=26673594, state=ROLLEDBACK, > exception=org.apache.hadoop.hbase.HBaseIOException via > master-merge-regions:org.apache.hadoop.hbase.HBaseIOException: The parent > region state=MERGING, location=rs-229,60020,1707587658182, table=table1, > region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up; > MergeTableRegionsProcedure table=table1, > regions=[a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b], > force=false exec-time=1.4820 sec_ > *We create a procedure to open the region a92008b76ccae47d55c590930b837036. > Intrestingly we didnt close the region as creation of procedure to close > regions had thrown exception and not execution of procedure. When we run TRSP > it sends a OpenRegionProcedure which is handled by AssignRegionHandler. This > handlers on execution suggests that region is already online* > Sequence of events are as follow > _2024-02-11 10:53:58,919 INFO [PEWorker-58] assignment.RegionStateStore - > pid=26674602 updating hbase:meta row=a92008b76ccae47d55c590930b837036, > regionState=OPENING, regionLocation=rs-210,60020,1707596461539_ > _2024-02-11 10:53:58,920 INFO [PEWorker-58] procedure2.ProcedureExecutor - > Initialized subprocedures=[\\{pid=26675798, ppid=26674602, state=RUNNABLE; > OpenRegionProcedure a92008b76ccae47d55c590930b837036, > server=rs-210,60020,1707596461539}]_ > _2024-02-11 10:53:59,074 WARN [REGION-regionserver/rs-210:60020-10] > handler.AssignRegionHandler - Received OPEN for > table1,r1,1685436252488.a92008b76ccae47d55c590930b837036. which is already > online_ -- This message was sent by Atlassian Jira (v8.20.10#820010)