[
https://issues.apache.org/jira/browse/HBASE-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048234#comment-18048234
]
Duo Zhang commented on HBASE-29797:
-----------------------------------
This is the log at master side, so at 11:00:32, we opened the region on data03
{noformat}
2025-12-29T11:00:32,255 INFO [PEWorker-5] assignment.RegionStateStore:
pid=158933 updating hbase:meta row=aea23dbdd1f955e04dab89e5327ae553,
regionState=OPEN, openSeqNum=1212222, regionLocation=data03,16020,1766977086762
2025-12-29T11:00:32,261 INFO [PEWorker-7] procedure2.ProcedureExecutor:
Finished pid=159187, ppid=158933, state=SUCCESS, hasLock=false;
OpenRegionProcedure aea23dbdd1f955e04dab89e5327ae553,
server=data03,16020,1766977086762 in 93 msec
2025-12-29T11:00:32,262 INFO [PEWorker-6] procedure2.ProcedureExecutor:
Finished pid=158933, ppid=158696, state=SUCCESS, hasLock=false;
TransitRegionStateProcedure table=IntegrationTestBigLinkedList,
region=aea23dbdd1f955e04dab89e5327ae553, REOPEN/MOVE in 401 msec
{noformat}
This can be connected with the NN log at 11:00:32, so
DFSClient_NONMAPREDUCE_836297033_1 should be data03.
But later, when we wanted to balance the region from data03 to data02, data03
was crashed, and finally we assign the region to data01, but after the SCP for
data03, we still received a successful on the CloseRegionProcedure of data03...
So there could be some race that even after data01 had started to take
assignment, data03 is still doing its closing work...
{noformat}
2025-12-29T11:02:55,044 INFO [master/meta02:16000.Chore.1] master.HMaster:
balance hri=aea23dbdd1f955e04dab89e5327ae553,
source=data03,16020,1766977086762, destination=data02,16020,1766977119966
2025-12-29T11:02:55,044 INFO [PEWorker-7] procedure.MasterProcedureScheduler:
Took xlock for pid=159704, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE,
hasLock=false; TransitRegionStateProcedure table=IntegrationTestBigLinkedList,
region=aea23dbdd1f955e04dab89e5327ae553, REOPEN/MOVE
2025-12-29T11:02:55,045 INFO [PEWorker-7] assignment.RegionStateStore:
pid=159704 updating hbase:meta row=aea23dbdd1f955e04dab89e5327ae553,
regionState=CLOSING, regionLocation=data03,16020,1766977086762
2025-12-29T11:02:55,055 INFO [PEWorker-16] procedure2.ProcedureExecutor:
Initialized subprocedures=[{pid=159705, ppid=159704, state=RUNNABLE,
hasLock=false; CloseRegionProcedure aea23dbdd1f955e04dab89e5327ae553,
server=data03,16020,1766977086762}]
2025-12-29T11:03:06,891 INFO [PEWorker-4] procedure.ServerCrashProcedure:
pid=159712, state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true;
ServerCrashProcedure data03,16020,1766977086762, splitWal=true, meta=true found
RIT pid=159704, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_CLOSED,
hasLock=true; TransitRegionStateProcedure table=IntegrationTestBigLinkedList,
region=aea23dbdd1f955e04dab89e5327ae553, REOPEN/MOVE; state=CLOSING,
location=data03,16020,1766977086762, table=IntegrationTestBigLinkedList,
region=aea23dbdd1f955e04dab89e5327ae553
2025-12-29T11:03:06,892 INFO [PEWorker-15] assignment.RegionStateStore:
pid=159704 updating hbase:meta row=aea23dbdd1f955e04dab89e5327ae553,
regionState=ABNORMALLY_CLOSED
2025-12-29T11:03:06,900 INFO [PEWorker-1] procedure2.ProcedureExecutor:
Finished pid=159705, ppid=159704, state=SUCCESS, hasLock=false;
CloseRegionProcedure aea23dbdd1f955e04dab89e5327ae553,
server=data03,16020,1766977086762 in 11.8430 sec
2025-12-29T11:03:06,901 INFO [PEWorker-1]
assignment.TransitRegionStateProcedure: Starting pid=159704,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true;
TransitRegionStateProcedure table=IntegrationTestBigLinkedList,
region=aea23dbdd1f955e04dab89e5327ae553, REOPEN/MOVE; state=ABNORMALLY_CLOSED,
location=null; forceNewPlan=true, retain=false
2025-12-29T11:03:07,009 INFO [PEWorker-11] assignment.RegionStateStore:
pid=159704 updating hbase:meta row=aea23dbdd1f955e04dab89e5327ae553,
regionState=OPENING, regionLocation=data01,16020,1766977187536
2025-12-29T11:03:07,027 INFO [PEWorker-3] procedure2.ProcedureExecutor:
Initialized subprocedures=[{pid=159770, ppid=159704, state=RUNNABLE,
hasLock=false; OpenRegionProcedure aea23dbdd1f955e04dab89e5327ae553,
server=data01,16020,1766977187536}]
2025-12-29T11:03:07,311 INFO [PEWorker-7] assignment.RegionStateStore:
pid=159704 updating hbase:meta row=aea23dbdd1f955e04dab89e5327ae553,
regionState=OPEN, openSeqNum=1212222, regionLocation=data01,16020,1766977187536
2025-12-29T11:03:07,321 INFO [PEWorker-2] procedure2.ProcedureExecutor:
Finished pid=159770, ppid=159704, state=SUCCESS, hasLock=false;
OpenRegionProcedure aea23dbdd1f955e04dab89e5327ae553,
server=data01,16020,1766977187536 in 293 msec
2025-12-29T11:03:07,321 INFO [PEWorker-4] procedure2.ProcedureExecutor:
Finished pid=159704, state=SUCCESS, hasLock=false; TransitRegionStateProcedure
table=IntegrationTestBigLinkedList, region=aea23dbdd1f955e04dab89e5327ae553,
REOPEN/MOVE in 12.2770 sec
{noformat}
> RegionServer aborted because of invalid max sequence id
> -------------------------------------------------------
>
> Key: HBASE-29797
> URL: https://issues.apache.org/jira/browse/HBASE-29797
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Reporter: Duo Zhang
> Priority: Critical
>
> {noformat}
> 2025-12-29T11:03:32,429 WARN [RS_CLOSE_REGION-regionserver/data02:16020-0]
> handler.UnassignRegionHandler: Fatal error occurred while closing region
> 8d60369be1061570a2f6e47a1af7a797, aborting...
> java.io.IOException: The new max sequence id 1212630 is less than the old max
> sequence id 1212631
> at
> org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:402)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1290)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1950)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1675)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1630)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1613)
> at
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:139)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> 2025-12-29T11:03:32,433 ERROR [RS_CLOSE_REGION-regionserver/data02:16020-0]
> regionserver.HRegionServer: ***** ABORTING region server
> data02,16020,1766977119966: Failed to close region
> 8d60369be1061570a2f6e47a1af7a797 and can not recover *****
> java.io.IOException: The new max sequence id 1212630 is less than the old max
> sequence id 1212631
> at
> org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:402)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1290)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1950)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1675)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1630)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1613)
> at
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:139)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)