[ 
https://issues.apache.org/jira/browse/HBASE-28405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834600#comment-17834600
 ] 

Viraj Jasani edited comment on HBASE-28405 at 4/7/24 5:21 AM:
--------------------------------------------------------------

I believe this should fix the issue:
{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
index a9ab6f502a..6beb0fcab7 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
@@ -99,9 +99,16 @@ public class AssignRegionHandler extends EventHandler {
     String encodedName = regionInfo.getEncodedName();
     byte[] encodedNameBytes = regionInfo.getEncodedNameAsBytes();
     String regionName = regionInfo.getRegionNameAsString();
-    Region onlineRegion = rs.getRegion(encodedName);
+    HRegion onlineRegion = rs.getRegion(encodedName);
     if (onlineRegion != null) {
       LOG.warn("Received OPEN for {} which is already online", regionName);
+      if (!rs.reportRegionStateTransition(
+        new RegionStateTransitionContext(TransitionCode.OPENED, 
onlineRegion.getOpenSeqNum(),
+          openProcId, masterSystemTime, onlineRegion.getRegionInfo()))) {
+        throw new IOException(
+          "Failed to report opened region to master: " + 
onlineRegion.getRegionInfo()
+            .getRegionNameAsString());
+      }
+      rs.finishRegionProcedure(openProcId);
       // Just follow the old behavior, do we need to call 
reportRegionStateTransition? Maybe not?
       // For normal case, it could happen that the rpc call to schedule this 
handler is succeeded,
       // but before returning to master the connection is broken. And when 
master tries again, we {code}
This would make assign region an idempotent operation.

 

And of course, we need to remove this comment section because now we know that 
this is not relevant anymore :)
{code:java}
// Just follow the old behavior, do we need to call 
reportRegionStateTransition? Maybe not?
// For normal case, it could happen that the rpc call to schedule this handler 
is succeeded,
// but before returning to master the connection is broken. And when master 
tries again, we
// have already finished the opening. For this case we do not need to call
// reportRegionStateTransition any more.{code}


was (Author: vjasani):
I believe this should fix the issue:
{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
index a9ab6f502a..6beb0fcab7 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
@@ -99,9 +99,16 @@ public class AssignRegionHandler extends EventHandler {
     String encodedName = regionInfo.getEncodedName();
     byte[] encodedNameBytes = regionInfo.getEncodedNameAsBytes();
     String regionName = regionInfo.getRegionNameAsString();
-    Region onlineRegion = rs.getRegion(encodedName);
+    HRegion onlineRegion = rs.getRegion(encodedName);
     if (onlineRegion != null) {
       LOG.warn("Received OPEN for {} which is already online", regionName);
+      if (!rs.reportRegionStateTransition(
+        new RegionStateTransitionContext(TransitionCode.OPENED, 
onlineRegion.getOpenSeqNum(),
+          openProcId, masterSystemTime, onlineRegion.getRegionInfo()))) {
+        throw new IOException(
+          "Failed to report opened region to master: " + 
onlineRegion.getRegionInfo()
+            .getRegionNameAsString());
+      }
+      rs.finishRegionProcedure(openProcId);
       // Just follow the old behavior, do we need to call 
reportRegionStateTransition? Maybe not?
       // For normal case, it could happen that the rpc call to schedule this 
handler is succeeded,
       // but before returning to master the connection is broken. And when 
master tries again, we {code}
This would make assign region an idempotent operation.

> Region open procedure silently returns without notifying the parent proc
> ------------------------------------------------------------------------
>
>                 Key: HBASE-28405
>                 URL: https://issues.apache.org/jira/browse/HBASE-28405
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.5.7
>            Reporter: Aman Poonia
>            Assignee: Aman Poonia
>            Priority: Major
>
> *We had a scenario in production where a merge operation had failed as below*
> _2024-02-11 10:53:57,715 ERROR [PEWorker-31] 
> assignment.MergeTableRegionsProcedure - Error trying to merge 
> [a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b] in 
> table1 (in state=MERGE_TABLE_REGIONS_CLOSE_REGIONS)_
> _org.apache.hadoop.hbase.HBaseIOException: The parent region state=MERGING, 
> location=rs-229,60020,1707587658182, table=table1, 
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up_
> _at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.createUnassignProceduresForSplitOrMerge(AssignmentManagerUtil.java:120)_
> _at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.createUnassignProcedures(MergeTableRegionsProcedure.java:648)_
> _at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:205)_
> _at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:79)_
> _at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)_
> _at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:922)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1964)_
> _at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1991)_
> *Now when we do rollback of failed merge operation we see a issue where 
> region is in state opened until the RS holding it stopped.*
> Rollback create a TRSP as below
> _2024-02-11 10:53:57,719 DEBUG [PEWorker-31] procedure2.ProcedureExecutor - 
> Stored [pid=26674602, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1, 
> region=a92008b76ccae47d55c590930b837036, ASSIGN]_
> *and rollback finished successfully*
> _2024-02-11 10:53:57,721 INFO [PEWorker-31] procedure2.ProcedureExecutor - 
> Rolled back pid=26673594, state=ROLLEDBACK, 
> exception=org.apache.hadoop.hbase.HBaseIOException via 
> master-merge-regions:org.apache.hadoop.hbase.HBaseIOException: The parent 
> region state=MERGING, location=rs-229,60020,1707587658182, table=table1, 
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up; 
> MergeTableRegionsProcedure table=table1, 
> regions=[a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b], 
> force=false exec-time=1.4820 sec_
> *We create a procedure to open the region a92008b76ccae47d55c590930b837036. 
> Intrestingly we didnt close the region as creation of procedure to close 
> regions had thrown exception and not execution of procedure. When we run TRSP 
> it sends a OpenRegionProcedure which is handled by AssignRegionHandler. This 
> handlers on execution suggests that region is already online*
> Sequence of events are as follow
> _2024-02-11 10:53:58,919 INFO [PEWorker-58] assignment.RegionStateStore - 
> pid=26674602 updating hbase:meta row=a92008b76ccae47d55c590930b837036, 
> regionState=OPENING, regionLocation=rs-210,60020,1707596461539_
> _2024-02-11 10:53:58,920 INFO [PEWorker-58] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[\\{pid=26675798, ppid=26674602, state=RUNNABLE; 
> OpenRegionProcedure a92008b76ccae47d55c590930b837036, 
> server=rs-210,60020,1707596461539}]_
> _2024-02-11 10:53:59,074 WARN [REGION-regionserver/rs-210:60020-10] 
> handler.AssignRegionHandler - Received OPEN for 
> table1,r1,1685436252488.a92008b76ccae47d55c590930b837036. which is already 
> online_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to