[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.
[ https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510172#comment-17510172 ] Huaxiang Sun commented on HBASE-26864: -- Thanks for explain. I assumed that report is associated with procId, and master would discard report when there is no outstanding procedure. For this specific case, there is a bug in handling Rollback in SplitTableRegionProcedure, preparing a patch. [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L304] [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L385] {code:java} In the state machine: case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION: addChildProcedure(createUnassignProcedures(env)); // Comments from HX: // createUnassignProcedures() can throw out IOException. If this happens, // it wont reach state SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGION and no parent regions // is closed as all created UnassignProcedures are rolled back. If it rolls back with // state SPLIT_TABLE_REGION_CLOSE_PARENT_REGION, no need to call openParentRegion(), // otherwise, it will result in OpenRegionProcedure for an already open region. setNextState(SplitTableRegionState.SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS); break; In the rollback, case SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS: // Doing nothing, in SPLIT_TABLE_REGION_CLOSE_PARENT_REGION, // we will bring parent region online break; case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION: // Comments from HX: // OpenParentRegion() should not be called here as explained above. openParentRegion(env); break; {code} > Region Server does not send Ack back to master after receiving an > OpenRegionReq for already opened regions, causing OpenRegionProcedure stay > forever. > - > > Key: HBASE-26864 > URL: https://issues.apache.org/jira/browse/HBASE-26864 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.4.10 >Reporter: Huaxiang Sun >Assignee: Huaxiang Sun >Priority: Major > > For some upgrading cases, we found that master issues RegionOpen for an > already open region and Region Sever simply logs > {code:java} > 2022-03-17 22:16:55,595 WARN > org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received > OPEN for > foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4. > which is already online {code} > and it does not ack or nack master. This OpenRegionProceduce is stuck forever. > In this specific case, it needs to ack master that region is open. > > For the cause of why it sent an OpenRegion request for an already open > region, it will be followed by another issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.
[ https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510167#comment-17510167 ] Duo Zhang commented on HBASE-26864: --- {quote} Can you elaborate more about the double assign issue? {quote} The retry of sending open region is at master side, which could be happen at any time. It is possible that the open region request is already finished at RS side but then the retry operation arrived, then we start to report back again. And before our report being processed correctly, the region is reassigned to another RS, then double assign happen. What you described here is another corner case, I think we need to check the code in SplitTableRegionProcedure. Maybe it does not work well when enabling region replica, or the rollback logic needs polish. Thanks. > Region Server does not send Ack back to master after receiving an > OpenRegionReq for already opened regions, causing OpenRegionProcedure stay > forever. > - > > Key: HBASE-26864 > URL: https://issues.apache.org/jira/browse/HBASE-26864 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.4.10 >Reporter: Huaxiang Sun >Assignee: Huaxiang Sun >Priority: Major > > For some upgrading cases, we found that master issues RegionOpen for an > already open region and Region Sever simply logs > {code:java} > 2022-03-17 22:16:55,595 WARN > org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received > OPEN for > foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4. > which is already online {code} > and it does not ack or nack master. This OpenRegionProceduce is stuck forever. > In this specific case, it needs to ack master that region is open. > > For the cause of why it sent an OpenRegion request for an already open > region, it will be followed by another issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.
[ https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510022#comment-17510022 ] Huaxiang Sun commented on HBASE-26864: -- [~zhangduo], Can you elaborate more about the double assign issue? I will provide more details about the root cause later today. So far as I read from the log, it is not caused by the cases you described. The sequence I found is that region is opened at RS A, A acks back to master that the region is opened. During postOpenDeployTasks, RS A finds that it needs to split the region, so it sends a split request to master. Master starts the RegionSplitProcedure and later it finds that a replica parent is still being opened. It rolls back the RegionSplitProduce and in the process, it sends OpenRegion request to RS A. Even if it does not ack in this case, it still needs to clean up some state, proc id is in submittedRegionProcedures, it needs to be cleaned. > Region Server does not send Ack back to master after receiving an > OpenRegionReq for already opened regions, causing OpenRegionProcedure stay > forever. > - > > Key: HBASE-26864 > URL: https://issues.apache.org/jira/browse/HBASE-26864 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.4.10 >Reporter: Huaxiang Sun >Assignee: Huaxiang Sun >Priority: Major > > For some upgrading cases, we found that master issues RegionOpen for an > already open region and Region Sever simply logs > {code:java} > 2022-03-17 22:16:55,595 WARN > org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received > OPEN for > foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4. > which is already online {code} > and it does not ack or nack master. This OpenRegionProceduce is stuck forever. > In this specific case, it needs to ack master that region is open. > > For the cause of why it sent an OpenRegion request for an already open > region, it will be followed by another issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.
[ https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509134#comment-17509134 ] Duo Zhang commented on HBASE-26864: --- Sending ack back may cause double assign issue. In the end of AssignRegionHandler, we will call HRegionServer.postOpenDeployTasks, in this method we will call HRegionServer.reportRegionStateTransition. And in HRegionServer.reportRegionStateTransition, there is a loop to make sure that we successfully send the ack back. So I think we still need to find the root cause here, either we miss the call to postOpenDeployTasks in some code path, or there is a race at master side? > Region Server does not send Ack back to master after receiving an > OpenRegionReq for already opened regions, causing OpenRegionProcedure stay > forever. > - > > Key: HBASE-26864 > URL: https://issues.apache.org/jira/browse/HBASE-26864 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.4.10 >Reporter: Huaxiang Sun >Assignee: Huaxiang Sun >Priority: Major > > For some upgrading cases, we found that master issues RegionOpen for an > already open region and Region Sever simply logs > {code:java} > 2022-03-17 22:16:55,595 WARN > org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received > OPEN for > foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4. > which is already online {code} > and it does not ack or nack master. This OpenRegionProceduce is stuck forever. > In this specific case, it needs to ack master that region is open. > > For the cause of why it sent an OpenRegion request for an already open > region, it will be followed by another issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)