[
https://issues.apache.org/jira/browse/HBASE-16754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576944#comment-15576944
]
Gary Helmling commented on HBASE-16754:
---------------------------------------
The underlying cause here is a regionserver (A) that stalls when a compaction
has recently been completed. Master sees rs A as down and farms out log splits
on the WALs, then reassigns the region with the recently completed compaction
to regionserver B. Regionserver B opens the region and obtains a list of the
store files, including the recently compacted files. Now rs A resumes from the
stall and, before the regionserver aborts, the CompactedHFilesDischarger runs,
archiving the previously compacted HFiles. Now rs B has storefiles in its list
which reference files which have been moved out from under it on HDFS. When we
try to get the FileStatus for one of the archived store files, we then receive
a FileNotFoundException.
We have a sort of fencing for this in the compaction marker written to the WAL
before compaction completes. However, after HBASE-15441, these markers are now
dropped by WALSplitter.LogRecoveredEditsOutputSink, along with the other
region-level markers it doesn't care about.
We have a test that the compaction marker removes compacted storefiles from the
store file manager in TestHRegion.testRecoveredEditsReplayCompaction(), but
that explicitly writes the store file marker in the recovered edits file. We
don't have existing coverage that the compaction marker makes it through log
splitting.
> Regions failing compaction due to referencing non-existent store file
> ---------------------------------------------------------------------
>
> Key: HBASE-16754
> URL: https://issues.apache.org/jira/browse/HBASE-16754
> Project: HBase
> Issue Type: Bug
> Reporter: Gary Helmling
> Assignee: Gary Helmling
> Priority: Blocker
> Fix For: 1.3.0
>
>
> Running a mixed read write workload on a recent build off branch-1.3, we are
> seeing compactions occasionally fail with errors like the following (actual
> filenames replaced with placeholders):
> {noformat}
> 16/09/27 16:57:28 ERROR regionserver.CompactSplitThread: Compaction selection
> failed Store = XXX, pri = 116
> java.io.FileNotFoundException: File does not exist:
> hdfs://.../hbase/data/ns/table/region/cf/XXfilenameXX
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>
> at
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getReferencedFileStatus(StoreFileInfo.java:342)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getFileStatus(StoreFileInfo.java:355)
>
> at
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getModificationTime(StoreFileInfo.java:360)
> at
> org.apache.hadoop.hbase.regionserver.StoreFile.getModificationTimeStamp(StoreFile.java:321)
>
> at
> org.apache.hadoop.hbase.regionserver.StoreUtils.getLowestTimestamp(StoreUtils.java:63)
> at
> org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy.shouldPerformMajorCompaction(RatioBasedCompactionPolicy.java:63)
> at
> org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.selectCompaction(SortedCompactionPolicy.java:82)
>
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.select(DefaultStoreEngine.java:107)
>
> at
> org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1644)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.selectCompaction(CompactSplitThread.java:373)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.access$100(CompactSplitThread.java:59)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.doCompaction(CompactSplitThread.java:498)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:568)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 16/09/27 17:01:31 ERROR regionserver.CompactSplitThread: Compaction selection
> failed Store = XXX, pri = 115
> java.io.FileNotFoundException: File does not exist:
> hdfs://.../hbase/data/ns/table/region/cf/XXfilenameXX
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>
> at
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getReferencedFileStatus(StoreFileInfo.java:342)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getFileStatus(StoreFileInfo.java:355)
>
> at
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getModificationTime(StoreFileInfo.java:360)
> at
> org.apache.hadoop.hbase.regionserver.StoreFile.getModificationTimeStamp(StoreFile.java:321)
>
> at
> org.apache.hadoop.hbase.regionserver.StoreUtils.getLowestTimestamp(StoreUtils.java:63)
> at
> org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy.shouldPerformMajorCompaction(RatioBasedCompactionPolicy.java:63)
> at
> org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.selectCompaction(SortedCompactionPolicy.java:82)
>
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.select(DefaultStoreEngine.java:107)
>
> at
> org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1644)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.selectCompaction(CompactSplitThread.java:373)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.access$100(CompactSplitThread.java:59)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.doCompaction(CompactSplitThread.java:498)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:568)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> It looks like we somehow deleted the underlying store file from HDFS
> (probably after it was compacted away), after the path was loaded into the
> list of store files for the region.
> For two cases of this that I looked into, in both cases the region in
> question was previously hosted by a regionserver that stalled, then aborted
> after its zk session expired. In both cases it looked like a compaction was
> also in progress. So it's possible that the compacted files are being
> deleted from HDFS by the stalled regionserver before it aborts, but after the
> region has been opened by a new regionserver. That's speculation though and
> needs to be substantiated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)