[ https://issues.apache.org/jira/browse/HBASE-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhiwen Deng updated HBASE-29364: -------------------------------- Description: We have encountered multiple instances where regions were opened on RegionServers (RS) that had already been offlined. It wasn't until recently that we discovered a potential cause for this issue, and the details of the problem are as follows: Our HDFS storage reached the online level, which caused the upper-level master and rs to be unable to write and abort. Finally, we manually accessed and deleted some data, and HDFS was restored. Then the hbck report showed that some regions were opened on the offline rs, which caused these regions to be unable to server. We finally used hbck2 to assigns these regions, and the problem was solved. h3. # The Problem Here is the analysis of the region transition for one specific region: 19f709990ad65ce3d51ddeaf29acf436: 2025-05-21, 05:48:11 : The region was assigned to {+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could not be opened on the target RS. Finally, the RS reported the open result to the Master: {code:java} 2025-05-21,05:48:11,646 INFO [RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600] org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN, seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499, ppid=78034, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code} 2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline. {code:java} 2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671, state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803, splitWal=true, meta=false in 16.0720 sec {code} 2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also encountered a failure, making the meta table unavailable, which caused the above region to get stuck in the RIT (Region-In-Transition) process. {code:java} 2025-05-21, 05:49:12,312 WARN [ProcExecTimeout] org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code} Due to the HDFS failure, the Master also performed an abort action. The new active Master continued to execute the previously incomplete Procedure2. {code:java} 2025-05-21, 06:02:38,423 INFO [master/master-hostname:20600:becomeActiveMaster] org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=78034, ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; TransitRegionStateProcedure table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN 2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster] org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=78034, ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; TransitRegionStateProcedure table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE, location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to restore RIT {code} When the Master switches, it modifies the region's state in the meta table based on the procedure's status {code:java} 2025-05-21, 06:07:52,433 INFO [master/master-hostname:20600:becomeActiveMaster] org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING, lastHost=rs-hostname-last,20700,1747776391310, regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702 2025-05-21, 06:07:52,433 WARN [master/master-hostname:20600:becomeActiveMaster] org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received report OPENED transition from rs-hostname,20700,1747777624803 for state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1 is less than the current one 174628702, ignoring... {code} I reviewed the relevant code and found that at this point, the region's state in the Master's memory is changed to OPENED, and with the state transition of RegionRemoteProcedureBase, it is persisted in the meta table. {code:java} void stateLoaded(AssignmentManager am, RegionStateNode regionNode){ if (state == RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) { try { restoreSucceedState(am, regionNode, seqId); } catch (IOException e) { // should not happen as we are just restoring the state throw new AssertionError(e); } } } @Override protected void restoreSucceedState(AssignmentManager am, RegionStateNode regionNode, long openSeqNum) throws IOException { if (regionNode.getState() == State.OPEN) { // should have already been persisted, ignore return; } regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED, openSeqNum); }{code} Therefore, a failed open state was persisted to the meta table as OPEN, and because the RS had already completed the SCP, its region would not be processed again. {code:java} 2025-05-21, 06:07:53,138 INFO [PEWorker-56] org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034 updating hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN, repBarrier=174628702, openSeqNum=174628702, regionLocation=rs-hostname,20700,1747777624803 {code} h3. # How to fix We can refer to the logic when the Master does not switch, where if the open fails, the region's state is not modified, thereby preventing the above process from occurring. was: We have encountered multiple instances where regions were opened on RegionServers (RS) that had already been offlined. It wasn't until recently that we discovered a potential cause for this issue, and the details of the problem are as follows: Our HDFS storage reached the online level, which caused the upper-level master and rs to be unable to write and abort. Finally, we manually accessed and deleted some data, and HDFS was restored. Then the hbck report showed that some regions were opened on the offline rs, which caused these regions to be unable to server. We finally used hbck2 to assigns these regions, and the problem was solved. Here is the analysis of the region transition for one specific region: 19f709990ad65ce3d51ddeaf29acf436: 2025-05-21, 05:48:11 : The region was assigned to {+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could not be opened on the target RS. Finally, the RS reported the open result to the Master: {code:java} 2025-05-21,05:48:11,646 INFO [RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600] org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN, seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499, ppid=78034, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code} 2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline. {code:java} 2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671, state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803, splitWal=true, meta=false in 16.0720 sec {code} 2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also encountered a failure, making the meta table unavailable, which caused the above region to get stuck in the RIT (Region-In-Transition) process. {code:java} 2025-05-21, 05:49:12,312 WARN [ProcExecTimeout] org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code} Due to the HDFS failure, the Master also performed an abort action. The new active Master continued to execute the previously incomplete Procedure2. {code:java} 2025-05-21, 06:02:38,423 INFO [master/master-hostname:20600:becomeActiveMaster] org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=78034, ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; TransitRegionStateProcedure table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN 2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster] org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=78034, ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; TransitRegionStateProcedure table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE, location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to restore RIT {code} When the Master switches, it modifies the region's state in the meta table based on the procedure's status {code:java} 2025-05-21, 06:07:52,433 INFO [master/master-hostname:20600:becomeActiveMaster] org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING, lastHost=rs-hostname-last,20700,1747776391310, regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702 2025-05-21, 06:07:52,433 WARN [master/master-hostname:20600:becomeActiveMaster] org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received report OPENED transition from rs-hostname,20700,1747777624803 for state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1 is less than the current one 174628702, ignoring... {code} I reviewed the relevant code and found that at this point, the region's state in the Master's memory is changed to OPENED, and with the state transition of RegionRemoteProcedureBase, it is persisted in the meta table. {code:java} void stateLoaded(AssignmentManager am, RegionStateNode regionNode){ if (state == RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) { try { restoreSucceedState(am, regionNode, seqId); } catch (IOException e) { // should not happen as we are just restoring the state throw new AssertionError(e); } } } @Override protected void restoreSucceedState(AssignmentManager am, RegionStateNode regionNode, long openSeqNum) throws IOException { if (regionNode.getState() == State.OPEN) { // should have already been persisted, ignore return; } regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED, openSeqNum); }{code} Therefore, a failed open state was persisted to the meta table as OPEN, and because the RS had already completed the SCP, its region would not be processed again. {code:java} 2025-05-21, 06:07:53,138 INFO [PEWorker-56] org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034 updating hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN, repBarrier=174628702, openSeqNum=174628702, regionLocation=rs-hostname,20700,1747777624803 {code} > Region will be opened in unknown regionserver when master is changed & rs > crashed > --------------------------------------------------------------------------------- > > Key: HBASE-29364 > URL: https://issues.apache.org/jira/browse/HBASE-29364 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 2.3.0 > Reporter: Zhiwen Deng > Priority: Major > > We have encountered multiple instances where regions were opened on > RegionServers (RS) that had already been offlined. It wasn't until recently > that we discovered a potential cause for this issue, and the details of the > problem are as follows: > Our HDFS storage reached the online level, which caused the upper-level > master and rs to be unable to write and abort. Finally, we manually accessed > and deleted some data, and HDFS was restored. Then the hbck report showed > that some regions were opened on the offline rs, which caused these regions > to be unable to server. We finally used hbck2 to assigns these regions, and > the problem was solved. > h3. # The Problem > Here is the analysis of the region transition for one specific region: > 19f709990ad65ce3d51ddeaf29acf436: > 2025-05-21, 05:48:11 : The region was assigned to > {+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could > not be opened on the target RS. Finally, the RS reported the open result to > the Master: > {code:java} > 2025-05-21,05:48:11,646 INFO > [RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600] > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received > report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN, > seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803, > table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499, > ppid=78034, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code} > 2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline. > {code:java} > 2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65] > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671, > state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803, > splitWal=true, meta=false in 16.0720 sec {code} > 2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also > encountered a failure, making the meta table unavailable, which caused the > above region to get stuck in the RIT (Region-In-Transition) process. > {code:java} > 2025-05-21, 05:49:12,312 WARN [ProcExecTimeout] > org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK > Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803, > table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code} > Due to the HDFS failure, the Master also performed an abort action. The new > active Master continued to execute the previously incomplete Procedure2. > {code:java} > 2025-05-21, 06:02:38,423 INFO > [master/master-hostname:20600:becomeActiveMaster] > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=78034, ppid=77973, > state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; > TransitRegionStateProcedure table=test:xxx, > region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN > 2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster] > org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach > pid=78034, ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; > TransitRegionStateProcedure table=test:xxx, > region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE, > location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to > restore RIT {code} > When the Master switches, it modifies the region's state in the meta table > based on the procedure's status > {code:java} > 2025-05-21, 06:07:52,433 INFO > [master/master-hostname:20600:becomeActiveMaster] > org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta > entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING, > lastHost=rs-hostname-last,20700,1747776391310, > regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702 > 2025-05-21, 06:07:52,433 WARN > [master/master-hostname:20600:becomeActiveMaster] > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received > report OPENED transition from rs-hostname,20700,1747777624803 for > state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx, > region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1 > is less than the current one 174628702, ignoring... > {code} > I reviewed the relevant code and found that at this point, the region's state > in the Master's memory is changed to OPENED, and with the state transition of > RegionRemoteProcedureBase, it is persisted in the meta table. > {code:java} > void stateLoaded(AssignmentManager am, RegionStateNode regionNode){ > if (state == > RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) { > try { > restoreSucceedState(am, regionNode, seqId); > } catch (IOException e) { > // should not happen as we are just restoring the state > throw new AssertionError(e); > } > } > } > @Override > protected void restoreSucceedState(AssignmentManager am, RegionStateNode > regionNode, > long openSeqNum) throws IOException { > if (regionNode.getState() == State.OPEN) { > // should have already been persisted, ignore > return; > } > regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED, > openSeqNum); > }{code} > Therefore, a failed open state was persisted to the meta table as OPEN, and > because the RS had already completed the SCP, its region would not be > processed again. > {code:java} > 2025-05-21, 06:07:53,138 INFO [PEWorker-56] > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034 > updating hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN, > repBarrier=174628702, openSeqNum=174628702, > regionLocation=rs-hostname,20700,1747777624803 {code} > h3. # How to fix > We can refer to the logic when the Master does not switch, where if the open > fails, the region's state is not modified, thereby preventing the above > process from occurring. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)