[
https://issues.apache.org/jira/browse/HBASE-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048243#comment-18048243
]
Duo Zhang commented on HBASE-29797:
-----------------------------------
This could be a very critical problem as it means our fencing way loses
efficacy...
{noformat}
zhangduo@meta02:/data/log/hadoop/nn$ grep
"/hbase/WALs/data03,16020,1766977086762" hadoop-zhangduo-namenode-meta02.log*
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 10:58:11,491 INFO
org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1081620498_7887584,
replicas=192.168.0.63:9866, 192.168.0.62:9866, 192.168.0.64:9866 for
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.1766977091448
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:02,791 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease.
Holder: DFSClient_NONMAPREDUCE_836297033_1, pending creates: 3],
src=/hbase/WALs/data03,16020,1766977086762-splitting/data03%2C16020%2C1766977086762.1766977091448
from client DFSClient_NONMAPREDUCE_836297033_1
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:02,791 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
Holder: DFSClient_NONMAPREDUCE_836297033_1, pending creates: 3],
src=/hbase/WALs/data03,16020,1766977086762-splitting/data03%2C16020%2C1766977086762.1766977091448
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:02,791 WARN
org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File
/hbase/WALs/data03,16020,1766977086762-splitting/data03%2C16020%2C1766977086762.1766977091448
has not been closed. Lease recovery is in progress. RecoveryId = 7887965 for
block blk_1081620498_7887584
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:04,915 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(oldBlock=BP-563920340-192.168.0.52-1744355183033:blk_1081620498_7887584,
file=/hbase/WALs/data03,16020,1766977086762-splitting/data03%2C16020%2C1766977086762.1766977091448,
newgenerationstamp=7887965, newlength=132260, newtargets=[192.168.0.63:9866,
192.168.0.62:9866, 192.168.0.64:9866]) successful
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:28,406 INFO
org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1081620892_7887980,
replicas=192.168.0.63:9866, 192.168.0.61:9866, 192.168.0.65:9866 for
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.1766977408365
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:28,436 INFO
org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1081620893_7887981,
replicas=192.168.0.63:9866, 192.168.0.61:9866, 192.168.0.65:9866 for
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.meta.1766977408375.meta
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:28,446 INFO
org.apache.hadoop.ipc.Server: IPC Server handler 7 on default port 8020, call
Call#9880 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease
from 192.168.0.63:45406: java.io.FileNotFoundException: File does not exist:
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.1766977091448
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:28,451 INFO
org.apache.hadoop.ipc.Server: IPC Server handler 9 on default port 8020, call
Call#9881 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease
from 192.168.0.63:45406: java.io.FileNotFoundException: File does not exist:
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.1766977091448
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:28,464 INFO
org.apache.hadoop.ipc.Server: IPC Server handler 6 on default port 8020, call
Call#9886 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.setTimes from
192.168.0.63:45406: java.io.FileNotFoundException: File/Directory
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.1766977091448
does not exist.
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:29,142 INFO
org.apache.hadoop.hdfs.StateChange: DIR* completeFile:
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.meta.1766977408375.meta
is closed by DFSClient_NONMAPREDUCE_836297033_1
hadoop-zhangduo-namenode-meta02.log.3:2025-12-29 11:03:29,162 INFO
org.apache.hadoop.hdfs.StateChange: DIR* completeFile:
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.1766977408365
is closed by DFSClient_NONMAPREDUCE_836297033_1
zhangduo@meta02:/data/log/hadoop/nn$ grep
"data03%2C16020%2C1766977086762.1766977408365"
hadoop-zhangduo-namenode-meta02.log.3
2025-12-29 11:03:28,406 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1081620892_7887980, replicas=192.168.0.63:9866, 192.168.0.61:9866,
192.168.0.65:9866 for
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.1766977408365
2025-12-29 11:03:29,162 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile:
/hbase/WALs/data03,16020,1766977086762/data03%2C16020%2C1766977086762.1766977408365
is closed by DFSClient_NONMAPREDUCE_836297033_1
{noformat}
Obviously, we had renamed the directory from data03,16020,1766977086762 to
data03,16020,1766977086762-splitting, and later, when data03 came back, it did
get some 'File does not exist' error, but it could still allocate new block
under data03,16020,1766977086762 directory and finally close the WAL files
successfully...
[~weichiu] Do you have any ideas on this behavior about HDFS? How could this
happen?
Thanks.
Let me see if I can reproduce it locally...
> RegionServer aborted because of invalid max sequence id
> -------------------------------------------------------
>
> Key: HBASE-29797
> URL: https://issues.apache.org/jira/browse/HBASE-29797
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Reporter: Duo Zhang
> Priority: Blocker
>
> {noformat}
> 2025-12-29T11:03:32,429 WARN [RS_CLOSE_REGION-regionserver/data02:16020-0]
> handler.UnassignRegionHandler: Fatal error occurred while closing region
> 8d60369be1061570a2f6e47a1af7a797, aborting...
> java.io.IOException: The new max sequence id 1212630 is less than the old max
> sequence id 1212631
> at
> org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:402)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1290)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1950)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1675)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1630)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1613)
> at
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:139)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> 2025-12-29T11:03:32,433 ERROR [RS_CLOSE_REGION-regionserver/data02:16020-0]
> regionserver.HRegionServer: ***** ABORTING region server
> data02,16020,1766977119966: Failed to close region
> 8d60369be1061570a2f6e47a1af7a797 and can not recover *****
> java.io.IOException: The new max sequence id 1212630 is less than the old max
> sequence id 1212631
> at
> org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:402)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1290)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1950)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1675)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1630)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1613)
> at
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:139)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)