[ 
https://issues.apache.org/jira/browse/HBASE-18771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-18771:
-----------------------------------
       Resolution: Fixed
     Hadoop Flags: Reviewed
    Fix Version/s: 3.0.0
                   2.0.0
           Status: Resolved  (was: Patch Available)

Pushed to branch-1.3 and up

> Incorrect StoreFileRefresh leading to split and compaction failures
> -------------------------------------------------------------------
>
>                 Key: HBASE-18771
>                 URL: https://issues.apache.org/jira/browse/HBASE-18771
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.3.1
>            Reporter: Abhishek Singh Chouhan
>            Assignee: Abhishek Singh Chouhan
>            Priority: Blocker
>             Fix For: 2.0.0, 3.0.0, 1.4.0, 1.3.2, 1.5.0
>
>         Attachments: HBASE-18771.branch-1.3.001.patch, 
> HBASE-18771.branch-1.3.002.patch, HBASE-18771.branch-1.3.003.patch, 
> HBASE-18771.branch-1.3.004.patch, HBASE-18771.branch-1.3.005.patch, 
> HBASE-18771.master.001.patch, HBASE-18771.master.002.patch, 
> HBASE-18771.master.003.patch
>
>
> We ran into issues of compaction and split failures with 1.3 similar to 
> HBASE-18186 and HBASE-17406. Here's what i believe is happening -
> Lets say we have 4 store files that are compacted to form a new one. At this 
> point we now have 5 store files, however only 1(the newly formed) is open now 
> for the store and rest are waiting to get archived by HFileArchiver
> Now before the files are archived we get a FNFE in a scanner. This results in 
> HRegion.RegionScannerImpl.handleFileNotFound(FileNotFoundException fnfe) 
> being called which results in region.refreshStoreFiles(true) -> 
> HStore.refreshStoreFiles()
> HStore.refreshStoreFiles now checks the hdfs dir and adds the previously 
> compacted files back to the store, however these files are also present in 
> StoreFileManager's compactedFiles list. Now at this point HFileArchiver runs, 
> checks compactedFiles list and moves these files into the archive directory. 
> Now when compaction runs it gets:
> 2017-09-04 12:30:13,899 ERROR [ctions-1504505399609] 
> regionserver.CompactSplitThread - Compaction selection failed regionName = 
> xxxx, storeName = 0, priority = 26, time = 1504528213899
> java.io.FileNotFoundException: File does not exist: hdfs://xxxx
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1337)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1329)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1329)
>         at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:422)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getReferencedFileStatus(StoreFileInfo.java:342)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getFileStatus(StoreFileInfo.java:355)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getModificationTime(StoreFileInfo.java:360)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFile.getModificationTimeStamp(StoreFile.java:325)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreUtils.getLowestTimestamp(StoreUtils.java:63)
>         at 
> org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy.shouldPerformMajorCompaction(RatioBasedCompactionPolicy.java:65)
>         at 
> org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.selectCompaction(SortedCompactionPolicy.java:82)
>         at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.select(DefaultStoreEngine.java:107)
>         at 
> org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1679)
> Similarly if a split happens after archival we fail after PONR while opening 
> daughter regions due to FNFE. This results in parent offline and daughters 
> also in a limbo since they're unable to open. Since we get the error after 
> PONR we also end up aborting the RS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to