[ 
https://issues.apache.org/jira/browse/HBASE-30154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HBASE-30154:
-----------------------------------
    Labels: pull-request-available  (was: )

> HFiles Deleted While Still Referenced by HFileLinks in Cloned Table
> -------------------------------------------------------------------
>
>                 Key: HBASE-30154
>                 URL: https://issues.apache.org/jira/browse/HBASE-30154
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 2.6.2
>            Reporter: Shrinidhi Shridhar Talpankar
>            Priority: Major
>              Labels: pull-request-available
>
> There is a race condition where Hfiles referenced by cloned table can be 
> permanently deleted by HFileLinkCleaner causing data loss. We've seen this 
> happening on HBase 2.6.2, but looking at the latest code it seems that the 
> race can still happen there.
> The sequence of operations that leads to this issue:
>  # Regions merged in the parent table.
>  # Snapshot taken while a merged region still holds reference files, and that 
> snapshot is then cloned. RestoreSnapshotHelper.restoreReferenceFile() creates 
> HFileLink Reference files in the clone (e.g. 
> srcTable=srcRegion-hfile.cloneRegion) but does not write back-references to 
> the archive directory.
>  # Parent table compacted. CatalogJanitor GCs the pre-merge regions, 
> archiving their HFiles.
>  # HFileLinkCleaner sees no back-references for those archived HFiles and 
> deletes them.
>  # Every subsequent region open/scan on the cloned table fails with 
> FileNotFoundException.
>  
> Logs evidence supporting the above flow:
> h4. 07:41:16Z — Merge 1 Initiated (RegionNormalizer)
> The {{RegionNormalizerWorker}} identified regions {{e95408fd}} (4511 MB) and 
> {{565ae77f}} (128 MB) as candidates for merging due to size imbalance. It 
> submitted {{MergeTableRegionsProcedure}} (pid=2307105).
> INFO [normalizer-worker-0] master.HMaster: merge regions 
> [e95408fd3383e18df6f3a171a6bfc05d, 565ae77f2c2e16d6ee539dd95540a240] INFO 
> [PEWorker-9] procedure.MasterProcedureScheduler: Took xlock for pid=2307105, 
> state=RUNNABLE:MERGE_TABLE_REGIONS_PREPARE
> Source regions unassigned from regionserver-11 and regionserver-13.
> h4. 07:41:18Z — Merge 1 Completes → {{43bb39bf}} created
> INFO [PEWorker-14] procedure2.ProcedureExecutor: Finished pid=2307105, 
> state=SUCCESS; MergeTableRegionsProcedure table=aeris_v2, 
> regions=[e95408fd3383e18df6f3a171a6bfc05d, 565ae77f2c2e16d6ee539dd95540a240], 
> force=false in 2.0710 sec
> Merged region {{43bb39bf414f2c2bb4c453a7c4ace246}} is assigned to 
> regionserver-11. Its store directory contains HFileLinks pointing back to the 
> source region directories.
> h4. 09:41:16Z — Merge 2 Initiated (RegionNormalizer)
> The normalizer identified regions {{449b39800}} (2964 MB) and {{c8bcf55f}} 
> (1694 MB) and submitted {{MergeTableRegionsProcedure}} (pid=2307118).
> INFO [normalizer-worker-0] master.HMaster: merge regions 
> [449b39800c6efb9ce3eca36410d292c1, c8bcf55fde67e62655c51caec1a2ca96]
> h4. 09:41:19Z — Merge 2 Completes → {{d5243cea}} created
> INFO [PEWorker-9] procedure2.ProcedureExecutor: Finished pid=2307118, 
> state=SUCCESS; MergeTableRegionsProcedure table=aeris_v2, 
> regions=[449b39800c6efb9ce3eca36410d292c1, c8bcf55fde67e62655c51caec1a2ca96], 
> force=false in 2.7930 sec
> Merged region {{d5243cea04f0a22e4292c1d006580385}} is assigned to 
> regionserver-88. Contains HFileLinks to {{449b39800}} and {{c8bcf55f}} source 
> HFiles.
> h4. 14:02:59Z — Major Compaction Starts on {{d5243cea}} (regionserver-88)
> The merged region immediately triggers a high-priority major compaction 
> (labeled as "recently split daughter region" — this is the HBase labeling for 
> merged regions). The compaction includes 9 files, among which are HFileLinks 
> to the source regions:
> INFO [longCompactions-0] regionserver.HStore: Starting compaction of 
> d5243cea.../e... 
> hdfs://.../d5243cea.../e/d2d48db764e645d1ad5eb2bb1d1409c0.c8bcf55fde67e62655c51caec1a2ca96->...
>  
> hdfs://.../d5243cea.../e/3c9628b6aeaa4ae5944667d35ddf0fc3.c8bcf55fde67e62655c51caec1a2ca96->...
>  totalSize=4.6 G
> *This compaction will take 44 minutes and 53 seconds.*
> h4. ~14:24Z — ⚠️ Snapshot Taken WHILE Compaction is Running
> A snapshot {{aeris_v2_triforce_snapshot_1773843841734812535}} is taken of 
> {{{}aeris_v2{}}}. The snapshot name encodes a Unix timestamp of approximately 
> {{1773843841}} seconds, corresponding to ~14:24 UTC March 18, 2026.
> At this moment: - {{d5243cea}} still contains HFileLinks to 
> {{{}449b39800{}}}/{{{}c8bcf55f{}}} (compaction has NOT finished) - 
> {{43bb39bf}} still contains HFileLinks to {{{}e95408fd{}}}/{{{}565ae77f{}}}
> The snapshot manifest captures this state — the merged regions' HFileLinks to 
> the source regions are part of the snapshot.
> h4. 14:24:15Z - 14:28:05Z — Clone Created: 
> {{aeris_v2_triforce_archive_1773843841734812535}}
> {{CloneSnapshotProcedure}} runs, invoking {{RestoreSnapshotHelper}} which 
> creates HFileLinks in the new archive table pointing directly to the source 
> region directories (the ones being cleaned up after merge). Key HFileLinks 
> added:
> INFO [RestoreSnapshot-pool-5] snapshot.RestoreSnapshotHelper: Adding 
> HFileLink 37211e9b8f2744f3a7e2707594b39fbb.565ae77f2c2e16d6ee539dd95540a240 
> from cloned region in snapshot aeris_v2_triforce_snapshot_1773843841734812535 
> to table=aeris_v2_triforce_archive_1773843841734812535 INFO 
> [RestoreSnapshot-pool-7] snapshot.RestoreSnapshotHelper: Adding HFileLink 
> 09aa8849c2d64984942a80b026e4a19e.449b39800c6efb9ce3eca36410d292c1 from cloned 
> region in snapshot aeris_v2_triforce_snapshot_1773843841734812535 to 
> table=aeris_v2_triforce_archive_1773843841734812535
> Clone procedure completes at 14:28:05Z:
> INFO [PEWorker-16] procedure.CloneSnapshotProcedure: Clone 
> snapshot=aeris_v2_triforce_snapshot_1773843841734812535 on 
> table=aeris_v2_triforce_archive_1773843841734812535 completed!
> h4. 14:47:52Z — ⚠️ Compaction of {{d5243cea}} Completes
> After 44 minutes and 53 seconds, the major compaction finishes:
> INFO [longCompactions-0] regionserver.CompactSplit: Completed compaction 
> region=aeris_v2,...d5243cea04f0a22e4292c1d006580385., 
> storeName=d5243cea04f0a22e4292c1d006580385/e, duration=44mins, 53sec INFO 
> [longCompactions-0] regionserver.HStore: Completed compaction of 9 (all) 
> file(s) in d5243cea04f0a22e4292c1d006580385/e into 
> 9e904c6e8b2247cf9c0463aecc1d67b1 (size=4.4 G)
> The HFileLinks to {{449b39800}} and {{c8bcf55f}} are *removed* from 
> {{{}d5243cea{}}}'s store directory. These source region files are no longer 
> referenced by any live region in {{{}aeris_v2{}}}. This is the gate condition 
> for {{GCMultipleMergedRegionsProcedure}} to proceed.
> h4. 14:51:50Z — 💥 GCMultipleMergedRegionsProcedure Deletes Source Regions 
> (Group 2)
> {{GCMultipleMergedRegionsProcedure}} (pid=2326702) runs immediately (only 
> 93ms total execution):
> INFO [PEWorker-11] procedure.MasterProcedureScheduler: Took xlock for 
> pid=2326702, state=RUNNABLE:GC_MERGED_REGIONS_PREPARE; 
> GCMultipleMergedRegionsProcedure child=d5243cea04f0a22e4292c1d006580385, 
> parents:[449b39800c6efb9ce3eca36410d292c1], 
> [c8bcf55fde67e62655c51caec1a2ca96] INFO [PEWorker-4] hbase.MetaTableAccessor: 
> Deleted aeris_v2,...449b39800c6efb9ce3eca36410d292c1. INFO [PEWorker-11] 
> hbase.MetaTableAccessor: Deleted 
> aeris_v2,...c8bcf55fde67e62655c51caec1a2ca96. INFO [PEWorker-10] 
> procedure2.ProcedureExecutor: Finished pid=2326702, state=SUCCESS; 
> GCMultipleMergedRegionsProcedure child=d5243cea04f0a22e4292c1d006580385, 
> parents:[449b39800c6efb9ce3eca36410d292c1], 
> [c8bcf55fde67e62655c51caec1a2ca96] in 93 msec
> The HDFS directories for {{449b39800c6efb9ce3eca36410d292c1}} and 
> {{c8bcf55fde67e62655c51caec1a2ca96}} — including both {{/hbase/data/}} and 
> {{/hbase/archive/}} paths — are deleted. The HFileLinks in 
> {{aeris_v2_triforce_archive}} now point to non-existent paths.
> h4. 15:27:37Z — Compaction Triggered on {{43bb39bf}}
> A flush on {{43bb39bf}} completes, requesting compaction:
> INFO [MemStoreFlusher.1] regionserver.HRegion: Finished flush of 
> 43bb39bf414f2c2bb4c453a7c4ace246 in 73ms, compaction requested=true
> h4. 17:15:36Z — First FileNotFoundException (Group 2: {{{}449b39800{}}})
> {{aeris_v2_triforce_archive}} attempts to read through HFileLinks, fails:
> java.io.FileNotFoundException: HFileLink locations=[ 
> hdfs://aeris/hbase/data/default/aeris_v2/449b39800c6efb9ce3eca36410d292c1/e/09aa8849c2d64984942a80b026e4a19e,
>  
> hdfs://aeris/hbase/archive/data/default/aeris_v2/449b39800c6efb9ce3eca36410d292c1/e/09aa8849c2d64984942a80b026e4a19e,
>  
> hdfs://aeris/hbase/.tmp/data/default/aeris_v2/449b39800c6efb9ce3eca36410d292c1/e/09aa8849c2d64984942a80b026e4a19e,
>  
> hdfs://aeris/hbase/mobdir/data/default/aeris_v2/449b39800c6efb9ce3eca36410d292c1/e/09aa8849c2d64984942a80b026e4a19e]
> All 4 candidate locations checked by {{HFileLink.open()}} are missing.
> h4. 18:06:05Z — Compaction of {{43bb39bf}} Completes (regionserver-11)
> INFO [longCompactions-0] regionserver.CompactSplit: Completed compaction 
> region=aeris_v2,...43bb39bf414f2c2bb4c453a7c4ace246., storeName=e, 
> fileCount=6, fileSize=4.5 G, duration=1mins, 35sec
> HFileLinks to {{e95408fd}} and {{565ae77f}} are removed from 
> {{{}43bb39bf{}}}. GC gate condition met for Group 1.
> h4. 18:11:52Z — 💥 GCMultipleMergedRegionsProcedure Deletes Source Regions 
> (Group 1)
> INFO [PEWorker-13] hbase.MetaTableAccessor: Deleted 
> aeris_v2,...565ae77f2c2e16d6ee539dd95540a240. INFO [PEWorker-7] 
> hbase.MetaTableAccessor: Deleted 
> aeris_v2,...e95408fd3383e18df6f3a171a6bfc05d. INFO [PEWorker-12] 
> procedure2.ProcedureExecutor: Finished pid=2357877, state=SUCCESS; 
> GCMultipleMergedRegionsProcedure child=43bb39bf414f2c2bb4c453a7c4ace246, 
> parents:[e95408fd3383e18df6f3a171a6bfc05d], 
> [565ae77f2c2e16d6ee539dd95540a240] in 73 msec
> HDFS directories for {{e95408fd3383e18df6f3a171a6bfc05d}} and 
> {{565ae77f2c2e16d6ee539dd95540a240}} are deleted.
> h4. 21:16:49Z — First FileNotFoundException (Group 1: {{{}565ae77f{}}})
> java.io.FileNotFoundException: HFileLink locations=[ 
> hdfs://aeris/hbase/data/default/aeris_v2/565ae77f2c2e16d6ee539dd95540a240/e/37211e9b8f2744f3a7e2707594b39fbb,
>  ...]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to