[ https://issues.apache.org/jira/browse/HBASE-26999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534536#comment-17534536 ]
Hudson commented on HBASE-26999: -------------------------------- Results for branch master [build #583 on builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/583/]: (x) *{color:red}-1 overall{color}* ---- details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/583/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/583/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/583/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > HStore should try write WAL compaction marker before replacing compacted > files in StoreEngine > --------------------------------------------------------------------------------------------- > > Key: HBASE-26999 > URL: https://issues.apache.org/jira/browse/HBASE-26999 > Project: HBase > Issue Type: Sub-task > Affects Versions: 2.5.0, 3.0.0-alpha-2, 2.6.0 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Major > Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3 > > > On HBASE-26064, it seems we altered the order we update different places with > the results of a compaction: > {noformat} > @@ -1510,14 +1149,13 @@ public class HStore implements Store, HeapSize, > StoreConfigInformation, > List<Path> newFiles) throws IOException { > // Do the steps necessary to complete the compaction. > setStoragePolicyFromFileName(newFiles); > - List<HStoreFile> sfs = commitStoreFiles(newFiles, true); > + List<HStoreFile> sfs = storeEngine.commitStoreFiles(newFiles, true); > if (this.getCoprocessorHost() != null) { > for (HStoreFile sf : sfs) { > getCoprocessorHost().postCompact(this, sf, cr.getTracker(), cr, > user); > } > } > - writeCompactionWalRecord(filesToCompact, sfs); > - replaceStoreFiles(filesToCompact, sfs); > + replaceStoreFiles(filesToCompact, sfs, true); > ... > @@ -1581,25 +1219,24 @@ public class HStore implements Store, HeapSize, > StoreConfigInformation, > this.region.getRegionInfo(), compactionDescriptor, > this.region.getMVCC()); > } > > - void replaceStoreFiles(Collection<HStoreFile> compactedFiles, > Collection<HStoreFile> result) > - throws IOException { > - this.lock.writeLock().lock(); > - try { > - > this.storeEngine.getStoreFileManager().addCompactionResults(compactedFiles, > result); > - synchronized (filesCompacting) { > - filesCompacting.removeAll(compactedFiles); > - } > - > - // These may be null when the RS is shutting down. The space quota > Chores will fix the Region > - // sizes later so it's not super-critical if we miss these. > - RegionServerServices rsServices = region.getRegionServerServices(); > - if (rsServices != null && > rsServices.getRegionServerSpaceQuotaManager() != null) { > - updateSpaceQuotaAfterFileReplacement( > - > rsServices.getRegionServerSpaceQuotaManager().getRegionSizeStore(), > getRegionInfo(), > - compactedFiles, result); > - } > - } finally { > - this.lock.writeLock().unlock(); > + @RestrictedApi(explanation = "Should only be called in TestHStore", link = > "", > + allowedOnPath = ".*/(HStore|TestHStore).java") > + void replaceStoreFiles(Collection<HStoreFile> compactedFiles, > Collection<HStoreFile> result, > + boolean writeCompactionMarker) throws IOException { > + storeEngine.replaceStoreFiles(compactedFiles, result); > + if (writeCompactionMarker) { > + writeCompactionWalRecord(compactedFiles, result); > + } > + synchronized (filesCompacting) { > + filesCompacting.removeAll(compactedFiles); > + } > + // These may be null when the RS is shutting down. The space quota > Chores will fix the Region > + // sizes later so it's not super-critical if we miss these. > + RegionServerServices rsServices = region.getRegionServerServices(); > + if (rsServices != null && rsServices.getRegionServerSpaceQuotaManager() > != null) { > + updateSpaceQuotaAfterFileReplacement( > + rsServices.getRegionServerSpaceQuotaManager().getRegionSizeStore(), > getRegionInfo(), > + compactedFiles, result); {noformat} > While running some large scale load test, we run into File SFT metafiles > inconsistency that we believe could have been avoided if the original order > was in place. Here the scenario we had: > 1) Region R with one CF f was open on RS1. At this time, the given store had > some files, let's say these were file1, file2 and file3; > 2) Compaction started on RS1; > 3) RS1 entered a long GC pause, lost ZK lock. Compaction is still running, > though. > 4) RS2 opens R. The related File SFT instance for this store then creates a > new meta file with file1, file2 and file3. > 5) Compaction on RS1 successfully completes the > *storeEngine.replaceStoreFiles* call. This updates the in memory cache of > valid files (StoreFileManager) and the SFT metafile for the store engine on > RS1 with the compaction resulting file, say file4, removing file1, file2 and > file3. Note that the SFT meta file used by RS1 here is different (older) than > the one used by RS2. > 6) Compaction on RS1 tries to update WAL marker, but fails to do so, as the > WAL already got closed when the RS1 ZK lock expired. This triggers a store > close in RS1. As part of the store close process, it removes all files it > sees as completed compacted, in this case, file1, file2 and file3. > 7) RS2 still references file1, file2 and file3. It then gets > FileNotFoundException when trying to open any of these files. > This situation would had been avoided if the original order of a) write WAL > marker, then b) replace store files was kept. -- This message was sent by Atlassian Jira (v8.20.7#820007)