[
https://issues.apache.org/jira/browse/HBASE-30137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18090070#comment-18090070
]
Prathyusha commented on HBASE-30137:
------------------------------------
[~wchevreuil] you are right, looks like there is a lot of overlap with this and
HBASE-26624, but HBASE-30137 is specifically trying to re-generate manifest
only if the latest manifest file is corrupted, and in this scenario Region Open
will be blocked cause of the corrupted manifest file and this primarily targets
around the recovery of the region and getting it open.
And with the introduction of Virtual links in HBASE-27826, there can be an
uncompacted link file part of the manifest, in which case we should do the data
recovery (like in case of any hfile corruptions). And also identify for
scenarios for example if all the parent hfiles are already in archive folder
which implicitly indicates compaction is finished and there is no dataloss
here.
This Jira first tries to figure out above part and creates path to unblock the
Region Open, I tried replicate this scenario of metadata file corruption via AI
and it did indeed block region to Open
{*}✅ HBASE-30137 repro CONFIRMED{*}{*}{*}
Here's the full failure stack — exactly the corruption-handling gap the JIRA
is about:
java.io.IOException: java.io.IOException: Invalid file length -1, either less
than 0 or greater then max allowed size 16777216
at
org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1187)
at
org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1130)
at
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:1031)
at
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:975)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7391)
...
at
org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler.process(AssignRegionHandler.java:154)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
Caused by: java.io.IOException: Invalid file length -1, either less than 0 or
greater then max allowed size 16777216
at
org.apache.hadoop.hbase.regionserver.storefiletracker.StoreFileListFile.load(StoreFileListFile.java:114)
at
org.apache.hadoop.hbase.regionserver.storefiletracker.StoreFileListFile.load(StoreFileListFile.java:144)
at
org.apache.hadoop.hbase.regionserver.storefiletracker.StoreFileListFile.load(StoreFileListFile.java:228)
at
org.apache.hadoop.hbase.regionserver.storefiletracker.FileBasedStoreFileTracker.doLoadStoreFiles(FileBasedStoreFileTracker.java:81)
at
org.apache.hadoop.hbase.regionserver.storefiletracker.StoreFileTrackerBase.load(StoreFileTrackerBase.java:89)
at
org.apache.hadoop.hbase.regionserver.StoreEngine.initialize(StoreEngine.java:362)
at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:306)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:6522)
*What's* {*}happening{*}{*}{*}
*My* *corruption* *was* *95* *bytes* *of* *0xFF**.* StoreFileListFile.load()
reads the file header — first 4 bytes parsed as int. Original was 00 00 00 57
(length=87). The corrupted file's first 4
bytes are FF FF FF FF → reads as *-1* as a signed int. The code's bounds
check at StoreFileListFile.java:114
(https://github.com/apache/hbase/blob/branch-2.6/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/storefiletracker/StoreFileListFile.java#L114)
then fails:
> In FSFT, generate a new manifest incase the latest manifest file gets
> corrupted
> -------------------------------------------------------------------------------
>
> Key: HBASE-30137
> URL: https://issues.apache.org/jira/browse/HBASE-30137
> Project: HBase
> Issue Type: Improvement
> Reporter: Prathyusha
> Priority: Major
>
> Add an offline recovery mechanism for corrupted FILE store file tracker
> manifest .
> The repair should support a minimal {{disk-only}} mode that rebuilds manifest
> contents from the child family directory, and a {{lineage-assisted}} mode
> that augments those files by simulating split/merge parent artifacts when
> current {{hbase:meta}} still proves lineage.
> The should write a fresh manifest generation instead of modifying the
> corrupted manifest in place, and support dry-run for safe operator use.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)