[ https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Gray updated HBASE-2375: --------------------------------- Fix Version/s: (was: 0.90.0) 0.92.0 Punting to 0.92 for now. The bigger compaction/flush improvements should happen in that version. > Make decision to split based on aggregate size of all StoreFiles and revisit > related config params > -------------------------------------------------------------------------------------------------- > > Key: HBASE-2375 > URL: https://issues.apache.org/jira/browse/HBASE-2375 > Project: HBase > Issue Type: Improvement > Components: regionserver > Affects Versions: 0.20.3 > Reporter: Jonathan Gray > Assignee: Jonathan Gray > Priority: Critical > Fix For: 0.92.0 > > Attachments: HBASE-2375-v8.patch > > > Currently we will make the decision to split a region when a single StoreFile > in a single family exceeds the maximum region size. This issue is about > changing the decision to split to be based on the aggregate size of all > StoreFiles in a single family (but still not aggregating across families). > This would move a check to split after flushes rather than after compactions. > This issue should also deal with revisiting our default values for some > related configuration parameters. > The motivating factor for this change comes from watching the behavior of > RegionServers during heavy write scenarios. > Today the default behavior goes like this: > - We fill up regions, and as long as you are not under global RS heap > pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) > StoreFiles. > - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a > compaction on this region. > - Compaction queues notwithstanding, this will create a 192MB file, not > triggering a split based on max region size (hbase.hregion.max.filesize). > - You'll then flush two more 64MB MemStores and hit the compactionThreshold > and trigger a compaction. > - You end up with 192 + 64 + 64 in a single compaction. This will create a > single 320MB and will trigger a split. > - While you are performing the compaction (which now writes out 64MB more > than the split size, so is about 5X slower than the time it takes to do a > single flush), you are still taking on additional writes into MemStore. > - Compaction finishes, decision to split is made, region is closed. The > region now has to flush whichever edits made it to MemStore while the > compaction ran. This flushing, in our tests, is by far the dominating factor > in how long data is unavailable during a split. We measured about 1 second > to do the region closing, master assignment, reopening. Flushing could take > 5-6 seconds, during which time the region is unavailable. > - The daughter regions re-open on the same RS. Immediately when the > StoreFiles are opened, a compaction is triggered across all of their > StoreFiles because they contain references. Since we cannot currently split > a split, we need to not hang on to these references for long. > This described behavior is really bad because of how often we have to rewrite > data onto HDFS. Imports are usually just IO bound as the RS waits to flush > and compact. In the above example, the first cell to be inserted into this > region ends up being written to HDFS 4 times (initial flush, first compaction > w/ no split decision, second compaction w/ split decision, third compaction > on daughter region). In addition, we leave a large window where we take on > edits (during the second compaction of 320MB) and then must make the region > unavailable as we flush it. > If we increased the compactionThreshold to be 5 and determined splits based > on aggregate size, the behavior becomes: > - We fill up regions, and as long as you are not under global RS heap > pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) > StoreFiles. > - After each MemStore flush, we calculate the aggregate size of all > StoreFiles. We can also check the compactionThreshold. For the first three > flushes, both would not hit the limit. On the fourth flush, we would see > total aggregate size = 256MB and determine to make a split. > - Decision to split is made, region is closed. This time, the region just > has to flush out whichever edits made it to the MemStore during the > snapshot/flush of the previous MemStore. So this time window has shrunk by > more than 75% as it was the time to write 64MB from memory not 320MB from > aggregating 5 hdfs files. This will greatly reduce the time data is > unavailable during splits. > - The daughter regions re-open on the same RS. Immediately when the > StoreFiles are opened, a compaction is triggered across all of their > StoreFiles because they contain references. This would stay the same. > In this example, we only write a given cell twice (instead of 4 times) while > drastically reducing data unavailability during splits. On the original > flush, and post-split to remove references. The other benefit of post-split > compaction (which doesn't change) is that we then get good data locality as > the resulting StoreFile will be written to the local DataNode. In another > jira, we should deal with opening up one of the daughter regions on a > different RS to distribute load better, but that's outside the scope of this > one. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.