[ 
https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Gray updated HBASE-2375:
---------------------------------

    Fix Version/s:     (was: 0.90.0)
                   0.92.0

Punting to 0.92 for now.  The bigger compaction/flush improvements should 
happen in that version.

> Make decision to split based on aggregate size of all StoreFiles and revisit 
> related config params
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2375
>                 URL: https://issues.apache.org/jira/browse/HBASE-2375
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.20.3
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-2375-v8.patch
>
>
> Currently we will make the decision to split a region when a single StoreFile 
> in a single family exceeds the maximum region size.  This issue is about 
> changing the decision to split to be based on the aggregate size of all 
> StoreFiles in a single family (but still not aggregating across families).  
> This would move a check to split after flushes rather than after compactions. 
>  This issue should also deal with revisiting our default values for some 
> related configuration parameters.
> The motivating factor for this change comes from watching the behavior of 
> RegionServers during heavy write scenarios.
> Today the default behavior goes like this:
> - We fill up regions, and as long as you are not under global RS heap 
> pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
> StoreFiles.
> - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a 
> compaction on this region.
> - Compaction queues notwithstanding, this will create a 192MB file, not 
> triggering a split based on max region size (hbase.hregion.max.filesize).
> - You'll then flush two more 64MB MemStores and hit the compactionThreshold 
> and trigger a compaction.
> - You end up with 192 + 64 + 64 in a single compaction.  This will create a 
> single 320MB and will trigger a split.
> - While you are performing the compaction (which now writes out 64MB more 
> than the split size, so is about 5X slower than the time it takes to do a 
> single flush), you are still taking on additional writes into MemStore.
> - Compaction finishes, decision to split is made, region is closed.  The 
> region now has to flush whichever edits made it to MemStore while the 
> compaction ran.  This flushing, in our tests, is by far the dominating factor 
> in how long data is unavailable during a split.  We measured about 1 second 
> to do the region closing, master assignment, reopening.  Flushing could take 
> 5-6 seconds, during which time the region is unavailable.
> - The daughter regions re-open on the same RS.  Immediately when the 
> StoreFiles are opened, a compaction is triggered across all of their 
> StoreFiles because they contain references.  Since we cannot currently split 
> a split, we need to not hang on to these references for long.
> This described behavior is really bad because of how often we have to rewrite 
> data onto HDFS.  Imports are usually just IO bound as the RS waits to flush 
> and compact.  In the above example, the first cell to be inserted into this 
> region ends up being written to HDFS 4 times (initial flush, first compaction 
> w/ no split decision, second compaction w/ split decision, third compaction 
> on daughter region).  In addition, we leave a large window where we take on 
> edits (during the second compaction of 320MB) and then must make the region 
> unavailable as we flush it.
> If we increased the compactionThreshold to be 5 and determined splits based 
> on aggregate size, the behavior becomes:
> - We fill up regions, and as long as you are not under global RS heap 
> pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
> StoreFiles.
> - After each MemStore flush, we calculate the aggregate size of all 
> StoreFiles.  We can also check the compactionThreshold.  For the first three 
> flushes, both would not hit the limit.  On the fourth flush, we would see 
> total aggregate size = 256MB and determine to make a split.
> - Decision to split is made, region is closed.  This time, the region just 
> has to flush out whichever edits made it to the MemStore during the 
> snapshot/flush of the previous MemStore.  So this time window has shrunk by 
> more than 75% as it was the time to write 64MB from memory not 320MB from 
> aggregating 5 hdfs files.  This will greatly reduce the time data is 
> unavailable during splits.
> - The daughter regions re-open on the same RS.  Immediately when the 
> StoreFiles are opened, a compaction is triggered across all of their 
> StoreFiles because they contain references.  This would stay the same.
> In this example, we only write a given cell twice (instead of 4 times) while 
> drastically reducing data unavailability during splits.  On the original 
> flush, and post-split to remove references.  The other benefit of post-split 
> compaction (which doesn't change) is that we then get good data locality as 
> the resulting StoreFile will be written to the local DataNode.  In another 
> jira, we should deal with opening up one of the daughter regions on a 
> different RS to distribute load better, but that's outside the scope of this 
> one.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to