[ 
https://issues.apache.org/jira/browse/HBASE-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiwen Deng updated HBASE-29364:
--------------------------------
    Description: 
We have encountered multiple instances where regions were opened on 
RegionServers (RS) that had already been offlined. It wasn't until recently 
that we discovered a potential cause for this issue, and the details of the 
problem are as follows:

Our HDFS storage reached the online level, which caused the upper-level master 
and rs to be unable to write and abort. Finally, we manually accessed and 
deleted some data, and HDFS was restored. Then the hbck report showed that some 
regions were opened on the offline rs, which caused these regions to be unable 
to server. We finally used hbck2 to assigns these regions, and the problem was 
solved.
h3. # The Problem

Here is the analysis of the region transition for one specific region: 
19f709990ad65ce3d51ddeaf29acf436:

2025-05-21, 05:48:11 : The region was assigned to 
{+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could not 
be opened on the target RS. Finally, the RS reported the open result to the 
Master:
{code:java}
2025-05-21,05:48:11,646 INFO 
[RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600] 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received 
report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN, 
seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803, 
table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499, 
ppid=78034, state=RUNNABLE; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code}
 2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline.
{code:java}
2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671, 
state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803, 
splitWal=true, meta=false in 16.0720 sec {code}
2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also encountered 
a failure, making the meta table unavailable, which caused the above region to 
get stuck in the RIT (Region-In-Transition) process.
{code:java}
2025-05-21, 05:49:12,312 WARN [ProcExecTimeout] 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803, 
table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code}
Due to the HDFS failure, the Master also performed an abort action. The new 
active Master continued to execute the previously incomplete Procedure2.
{code:java}
2025-05-21, 06:02:38,423 INFO [master/master-hostname:20600:becomeActiveMaster] 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
for pid=78034, ppid=77973, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
TransitRegionStateProcedure table=test:xxx, 
region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN

2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster] 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=78034, 
ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
TransitRegionStateProcedure table=test:xxx, 
region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE, 
location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to 
restore RIT {code}
When the Master switches, it modifies the region's state in the meta table 
based on the procedure's status
{code:java}
2025-05-21, 06:07:52,433 INFO [master/master-hostname:20600:becomeActiveMaster] 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta 
entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING, 
lastHost=rs-hostname-last,20700,1747776391310, 
regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702

2025-05-21, 06:07:52,433 WARN [master/master-hostname:20600:becomeActiveMaster] 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received report 
OPENED transition from rs-hostname,20700,1747777624803 for state=OPENING, 
location=rs-hostname,20700,1747777624803, table=test:xxx, 
region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1 is 
less than the current one 174628702, ignoring...
 {code}
I reviewed the relevant code and found that at this point, the region's state 
in the Master's memory is changed to OPENED, and with the state transition of 
RegionRemoteProcedureBase, it is persisted in the meta table.
{code:java}
void stateLoaded(AssignmentManager am, RegionStateNode regionNode){
  if (state == 
RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) {
    try {
      restoreSucceedState(am, regionNode, seqId);
    } catch (IOException e) {
      // should not happen as we are just restoring the state
      throw new AssertionError(e);
    }
  }
} 

@Override
protected void restoreSucceedState(AssignmentManager am, RegionStateNode 
regionNode,
  long openSeqNum) throws IOException {
  if (regionNode.getState() == State.OPEN) {
    // should have already been persisted, ignore
    return;
  }
  regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED, 
openSeqNum);
}{code}
Therefore, a failed open state was persisted to the meta table as OPEN, and 
because the RS had already completed the SCP, its region would not be processed 
again.
{code:java}
2025-05-21, 06:07:53,138 INFO [PEWorker-56] 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034 updating 
hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN, 
repBarrier=174628702, openSeqNum=174628702, 
regionLocation=rs-hostname,20700,1747777624803 {code}
h3. # How to fix
We can refer to the logic when the Master does not switch, where if the open 
fails, the region's state is not modified, thereby preventing the above process 
from occurring.
 

 

 

  was:
We have encountered multiple instances where regions were opened on 
RegionServers (RS) that had already been offlined. It wasn't until recently 
that we discovered a potential cause for this issue, and the details of the 
problem are as follows:

Our HDFS storage reached the online level, which caused the upper-level master 
and rs to be unable to write and abort. Finally, we manually accessed and 
deleted some data, and HDFS was restored. Then the hbck report showed that some 
regions were opened on the offline rs, which caused these regions to be unable 
to server. We finally used hbck2 to assigns these regions, and the problem was 
solved.

 Here is the analysis of the region transition for one specific region: 
19f709990ad65ce3d51ddeaf29acf436:

2025-05-21, 05:48:11 : The region was assigned to 
{+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could not 
be opened on the target RS. Finally, the RS reported the open result to the 
Master:
{code:java}
2025-05-21,05:48:11,646 INFO 
[RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600] 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received 
report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN, 
seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803, 
table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499, 
ppid=78034, state=RUNNABLE; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code}
 2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline.

 
{code:java}
2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671, 
state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803, 
splitWal=true, meta=false in 16.0720 sec {code}
2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also encountered 
a failure, making the meta table unavailable, which caused the above region to 
get stuck in the RIT (Region-In-Transition) process.
{code:java}
2025-05-21, 05:49:12,312 WARN [ProcExecTimeout] 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803, 
table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code}
Due to the HDFS failure, the Master also performed an abort action. The new 
active Master continued to execute the previously incomplete Procedure2.
{code:java}
2025-05-21, 06:02:38,423 INFO [master/master-hostname:20600:becomeActiveMaster] 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
for pid=78034, ppid=77973, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
TransitRegionStateProcedure table=test:xxx, 
region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN

2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster] 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=78034, 
ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
TransitRegionStateProcedure table=test:xxx, 
region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE, 
location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to 
restore RIT {code}
When the Master switches, it modifies the region's state in the meta table 
based on the procedure's status
{code:java}
2025-05-21, 06:07:52,433 INFO [master/master-hostname:20600:becomeActiveMaster] 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta 
entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING, 
lastHost=rs-hostname-last,20700,1747776391310, 
regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702

2025-05-21, 06:07:52,433 WARN [master/master-hostname:20600:becomeActiveMaster] 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received report 
OPENED transition from rs-hostname,20700,1747777624803 for state=OPENING, 
location=rs-hostname,20700,1747777624803, table=test:xxx, 
region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1 is 
less than the current one 174628702, ignoring...
 {code}
I reviewed the relevant code and found that at this point, the region's state 
in the Master's memory is changed to OPENED, and with the state transition of 
RegionRemoteProcedureBase, it is persisted in the meta table.
{code:java}
void stateLoaded(AssignmentManager am, RegionStateNode regionNode){
  if (state == 
RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) {
    try {
      restoreSucceedState(am, regionNode, seqId);
    } catch (IOException e) {
      // should not happen as we are just restoring the state
      throw new AssertionError(e);
    }
  }
} 

@Override
protected void restoreSucceedState(AssignmentManager am, RegionStateNode 
regionNode,
  long openSeqNum) throws IOException {
  if (regionNode.getState() == State.OPEN) {
    // should have already been persisted, ignore
    return;
  }
  regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED, 
openSeqNum);
}{code}
Therefore, a failed open state was persisted to the meta table as OPEN, and 
because the RS had already completed the SCP, its region would not be processed 
again.
{code:java}
2025-05-21, 06:07:53,138 INFO [PEWorker-56] 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034 updating 
hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN, 
repBarrier=174628702, openSeqNum=174628702, 
regionLocation=rs-hostname,20700,1747777624803 {code}
 

 

 

 

 


> Region will be opened in unknown regionserver when master is changed & rs 
> crashed
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-29364
>                 URL: https://issues.apache.org/jira/browse/HBASE-29364
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.3.0
>            Reporter: Zhiwen Deng
>            Priority: Major
>
> We have encountered multiple instances where regions were opened on 
> RegionServers (RS) that had already been offlined. It wasn't until recently 
> that we discovered a potential cause for this issue, and the details of the 
> problem are as follows:
> Our HDFS storage reached the online level, which caused the upper-level 
> master and rs to be unable to write and abort. Finally, we manually accessed 
> and deleted some data, and HDFS was restored. Then the hbck report showed 
> that some regions were opened on the offline rs, which caused these regions 
> to be unable to server. We finally used hbck2 to assigns these regions, and 
> the problem was solved.
> h3. # The Problem
> Here is the analysis of the region transition for one specific region: 
> 19f709990ad65ce3d51ddeaf29acf436:
> 2025-05-21, 05:48:11 : The region was assigned to 
> {+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could 
> not be opened on the target RS. Finally, the RS reported the open result to 
> the Master:
> {code:java}
> 2025-05-21,05:48:11,646 INFO 
> [RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600] 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received 
> report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN, 
> seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803, 
> table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499, 
> ppid=78034, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code}
>  2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline.
> {code:java}
> 2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671, 
> state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803, 
> splitWal=true, meta=false in 16.0720 sec {code}
> 2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also 
> encountered a failure, making the meta table unavailable, which caused the 
> above region to get stuck in the RIT (Region-In-Transition) process.
> {code:java}
> 2025-05-21, 05:49:12,312 WARN [ProcExecTimeout] 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803, 
> table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code}
> Due to the HDFS failure, the Master also performed an abort action. The new 
> active Master continued to execute the previously incomplete Procedure2.
> {code:java}
> 2025-05-21, 06:02:38,423 INFO 
> [master/master-hostname:20600:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
> for pid=78034, ppid=77973, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
> TransitRegionStateProcedure table=test:xxx, 
> region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN
> 2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach 
> pid=78034, ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
> TransitRegionStateProcedure table=test:xxx, 
> region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE, 
> location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to 
> restore RIT {code}
> When the Master switches, it modifies the region's state in the meta table 
> based on the procedure's status
> {code:java}
> 2025-05-21, 06:07:52,433 INFO 
> [master/master-hostname:20600:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta 
> entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING, 
> lastHost=rs-hostname-last,20700,1747776391310, 
> regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702
> 2025-05-21, 06:07:52,433 WARN 
> [master/master-hostname:20600:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received 
> report OPENED transition from rs-hostname,20700,1747777624803 for 
> state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx, 
> region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1 
> is less than the current one 174628702, ignoring...
>  {code}
> I reviewed the relevant code and found that at this point, the region's state 
> in the Master's memory is changed to OPENED, and with the state transition of 
> RegionRemoteProcedureBase, it is persisted in the meta table.
> {code:java}
> void stateLoaded(AssignmentManager am, RegionStateNode regionNode){
>   if (state == 
> RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) {
>     try {
>       restoreSucceedState(am, regionNode, seqId);
>     } catch (IOException e) {
>       // should not happen as we are just restoring the state
>       throw new AssertionError(e);
>     }
>   }
> } 
> @Override
> protected void restoreSucceedState(AssignmentManager am, RegionStateNode 
> regionNode,
>   long openSeqNum) throws IOException {
>   if (regionNode.getState() == State.OPEN) {
>     // should have already been persisted, ignore
>     return;
>   }
>   regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED, 
> openSeqNum);
> }{code}
> Therefore, a failed open state was persisted to the meta table as OPEN, and 
> because the RS had already completed the SCP, its region would not be 
> processed again.
> {code:java}
> 2025-05-21, 06:07:53,138 INFO [PEWorker-56] 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034 
> updating hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN, 
> repBarrier=174628702, openSeqNum=174628702, 
> regionLocation=rs-hostname,20700,1747777624803 {code}
> h3. # How to fix
> We can refer to the logic when the Master does not switch, where if the open 
> fails, the region's state is not modified, thereby preventing the above 
> process from occurring.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to