[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.

2022-03-21 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510172#comment-17510172
 ] 

Huaxiang Sun commented on HBASE-26864:
--

Thanks for explain. I assumed that report is associated with procId, and master 
would discard report when there is no outstanding procedure.

For this specific case, there is a bug in handling Rollback in 
SplitTableRegionProcedure, preparing a patch.

[https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L304]

[https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L385]
{code:java}
In the state machine:


        case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
          addChildProcedure(createUnassignProcedures(env));
  // Comments from HX:
          // createUnassignProcedures() can throw out IOException. If this 
happens,
          // it wont reach state SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGION and no 
parent regions
          // is closed as all created UnassignProcedures are rolled back. If it 
rolls back with
          // state SPLIT_TABLE_REGION_CLOSE_PARENT_REGION, no need to call 
openParentRegion(),
          // otherwise, it will result in OpenRegionProcedure for an already 
open region.
          
setNextState(SplitTableRegionState.SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS);
          break;


In the rollback,


        case SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS:
          // Doing nothing, in SPLIT_TABLE_REGION_CLOSE_PARENT_REGION,
          // we will bring parent region online
          break;
        case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
  // Comments from HX: 
  // OpenParentRegion() should not be called here as explained above.
          openParentRegion(env);
          break; {code}

> Region Server does not send Ack back to master after receiving an 
> OpenRegionReq for already opened regions, causing OpenRegionProcedure stay 
> forever.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> For some upgrading cases, we found that master issues RegionOpen for an 
> already open region and Region Sever simply logs 
> {code:java}
> 2022-03-17 22:16:55,595 WARN 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received 
> OPEN for 
> foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
>  which is already online {code}
> and it does not ack or nack master. This OpenRegionProceduce is stuck forever.
> In this specific case, it needs to ack master that region is open. 
>  
> For the cause of why it sent an OpenRegion request for an already open 
> region, it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.

2022-03-21 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510167#comment-17510167
 ] 

Duo Zhang commented on HBASE-26864:
---

{quote}
Can you elaborate more about the double assign issue?
{quote}

The retry of sending open region is at master side, which could be happen at 
any time. It is possible that the open region request is already finished at RS 
side but then the retry operation arrived, then we start to report back again. 
And before our report being processed correctly, the region is reassigned to 
another RS, then double assign happen.

What you described here is another corner case, I think we need to check the 
code in SplitTableRegionProcedure. Maybe it does not work well when enabling 
region replica, or the rollback logic needs polish.

Thanks.

> Region Server does not send Ack back to master after receiving an 
> OpenRegionReq for already opened regions, causing OpenRegionProcedure stay 
> forever.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> For some upgrading cases, we found that master issues RegionOpen for an 
> already open region and Region Sever simply logs 
> {code:java}
> 2022-03-17 22:16:55,595 WARN 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received 
> OPEN for 
> foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
>  which is already online {code}
> and it does not ack or nack master. This OpenRegionProceduce is stuck forever.
> In this specific case, it needs to ack master that region is open. 
>  
> For the cause of why it sent an OpenRegion request for an already open 
> region, it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.

2022-03-21 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510022#comment-17510022
 ] 

Huaxiang Sun commented on HBASE-26864:
--

[~zhangduo], Can you elaborate more about the double assign issue? I will 
provide more details about the root cause later today. So far as I read from 
the log,  it is not caused by the cases you described. The sequence I found is 
that region is opened at RS A, A acks back to master that the region is opened. 
During postOpenDeployTasks, RS A finds that it needs to split the region, so it 
sends a split request to master. Master starts the RegionSplitProcedure and 
later it finds that a replica parent is still being opened. It rolls back the 
RegionSplitProduce and in the process, it sends OpenRegion request to RS A.

 

Even if it does not ack in this case, it still needs to clean up some state, 
proc id is in submittedRegionProcedures, it needs to be cleaned.

> Region Server does not send Ack back to master after receiving an 
> OpenRegionReq for already opened regions, causing OpenRegionProcedure stay 
> forever.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> For some upgrading cases, we found that master issues RegionOpen for an 
> already open region and Region Sever simply logs 
> {code:java}
> 2022-03-17 22:16:55,595 WARN 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received 
> OPEN for 
> foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
>  which is already online {code}
> and it does not ack or nack master. This OpenRegionProceduce is stuck forever.
> In this specific case, it needs to ack master that region is open. 
>  
> For the cause of why it sent an OpenRegion request for an already open 
> region, it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.

2022-03-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509134#comment-17509134
 ] 

Duo Zhang commented on HBASE-26864:
---

Sending ack back may cause double assign issue. In the end of 
AssignRegionHandler, we will call HRegionServer.postOpenDeployTasks, in this 
method we will call HRegionServer.reportRegionStateTransition. And in 
HRegionServer.reportRegionStateTransition, there is a loop to make sure that we 
successfully send the ack back.

So I think we still need to find the root cause here, either we miss the call 
to postOpenDeployTasks in some code path, or there is a race at master side?

> Region Server does not send Ack back to master after receiving an 
> OpenRegionReq for already opened regions, causing OpenRegionProcedure stay 
> forever.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> For some upgrading cases, we found that master issues RegionOpen for an 
> already open region and Region Sever simply logs 
> {code:java}
> 2022-03-17 22:16:55,595 WARN 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received 
> OPEN for 
> foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
>  which is already online {code}
> and it does not ack or nack master. This OpenRegionProceduce is stuck forever.
> In this specific case, it needs to ack master that region is open. 
>  
> For the cause of why it sent an OpenRegion request for an already open 
> region, it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)