[jira] [Commented] (HBASE-2375) Make decision to split based on aggregate size of all StoreFiles and revisit related config params

2012-02-13 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207063#comment-13207063
 ] 

stack commented on HBASE-2375:
--

+1 on patch

Make new issue to change compactionThreshod to at least 4 I'd say and then you 
can close out this one?

 Make decision to split based on aggregate size of all StoreFiles and revisit 
 related config params
 --

 Key: HBASE-2375
 URL: https://issues.apache.org/jira/browse/HBASE-2375
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.20.3
Reporter: Jonathan Gray
Assignee: Jonathan Gray
Priority: Critical
  Labels: moved_from_0_20_5
 Attachments: HBASE-2375-flush-split.patch, HBASE-2375-v8.patch


 Currently we will make the decision to split a region when a single StoreFile 
 in a single family exceeds the maximum region size.  This issue is about 
 changing the decision to split to be based on the aggregate size of all 
 StoreFiles in a single family (but still not aggregating across families).  
 This would move a check to split after flushes rather than after compactions. 
  This issue should also deal with revisiting our default values for some 
 related configuration parameters.
 The motivating factor for this change comes from watching the behavior of 
 RegionServers during heavy write scenarios.
 Today the default behavior goes like this:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a 
 compaction on this region.
 - Compaction queues notwithstanding, this will create a 192MB file, not 
 triggering a split based on max region size (hbase.hregion.max.filesize).
 - You'll then flush two more 64MB MemStores and hit the compactionThreshold 
 and trigger a compaction.
 - You end up with 192 + 64 + 64 in a single compaction.  This will create a 
 single 320MB and will trigger a split.
 - While you are performing the compaction (which now writes out 64MB more 
 than the split size, so is about 5X slower than the time it takes to do a 
 single flush), you are still taking on additional writes into MemStore.
 - Compaction finishes, decision to split is made, region is closed.  The 
 region now has to flush whichever edits made it to MemStore while the 
 compaction ran.  This flushing, in our tests, is by far the dominating factor 
 in how long data is unavailable during a split.  We measured about 1 second 
 to do the region closing, master assignment, reopening.  Flushing could take 
 5-6 seconds, during which time the region is unavailable.
 - The daughter regions re-open on the same RS.  Immediately when the 
 StoreFiles are opened, a compaction is triggered across all of their 
 StoreFiles because they contain references.  Since we cannot currently split 
 a split, we need to not hang on to these references for long.
 This described behavior is really bad because of how often we have to rewrite 
 data onto HDFS.  Imports are usually just IO bound as the RS waits to flush 
 and compact.  In the above example, the first cell to be inserted into this 
 region ends up being written to HDFS 4 times (initial flush, first compaction 
 w/ no split decision, second compaction w/ split decision, third compaction 
 on daughter region).  In addition, we leave a large window where we take on 
 edits (during the second compaction of 320MB) and then must make the region 
 unavailable as we flush it.
 If we increased the compactionThreshold to be 5 and determined splits based 
 on aggregate size, the behavior becomes:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After each MemStore flush, we calculate the aggregate size of all 
 StoreFiles.  We can also check the compactionThreshold.  For the first three 
 flushes, both would not hit the limit.  On the fourth flush, we would see 
 total aggregate size = 256MB and determine to make a split.
 - Decision to split is made, region is closed.  This time, the region just 
 has to flush out whichever edits made it to the MemStore during the 
 snapshot/flush of the previous MemStore.  So this time window has shrunk by 
 more than 75% as it was the time to write 64MB from memory not 320MB from 
 aggregating 5 hdfs files.  This will greatly reduce the time data is 
 unavailable during splits.
 - The daughter regions re-open on the same RS.  Immediately when the 
 StoreFiles are opened, a compaction is triggered across all of their 
 StoreFiles because they contain references. 

[jira] [Commented] (HBASE-2375) Make decision to split based on aggregate size of all StoreFiles and revisit related config params

2012-02-09 Thread Jean-Daniel Cryans (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204701#comment-13204701
 ] 

Jean-Daniel Cryans commented on HBASE-2375:
---

A bunch of things changed since this jira was created: 
 - we now split based on the store size 
 - regions split at 1GB 
 - memstores flush at 128MB 
 - there's been a lot of work on tuning the store file selection algorithm 

My understanding of this jira is that it aims at making the out of the box 
mass import experience better. Now that we have bulk loads and pre-splitting 
this use case is becoming less and less important... although we still see 
people trying to benchmark it (hi hypertable). 

I see three things we could do:
 - Trigger splits after flushes, I hacked a patch and it works awesomely
 - Have a lower split size for newly created tables. Hypertable does this with 
a soft limit that gets doubled every time the table splits until it reaches the 
normal split size
 - Have multi-way splits (Todd's idea), so that if you have enough data that 
you know you're going to be splitting after the current split then just spawn 
as many daughters as you need.

 I'm planning on just fixing the first bullet point in the context of this 
jira. Maybe there's another stuff from the patch in this jira that we could fit 
in.

 Make decision to split based on aggregate size of all StoreFiles and revisit 
 related config params
 --

 Key: HBASE-2375
 URL: https://issues.apache.org/jira/browse/HBASE-2375
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.20.3
Reporter: Jonathan Gray
Assignee: Jonathan Gray
Priority: Critical
  Labels: moved_from_0_20_5
 Attachments: HBASE-2375-v8.patch


 Currently we will make the decision to split a region when a single StoreFile 
 in a single family exceeds the maximum region size.  This issue is about 
 changing the decision to split to be based on the aggregate size of all 
 StoreFiles in a single family (but still not aggregating across families).  
 This would move a check to split after flushes rather than after compactions. 
  This issue should also deal with revisiting our default values for some 
 related configuration parameters.
 The motivating factor for this change comes from watching the behavior of 
 RegionServers during heavy write scenarios.
 Today the default behavior goes like this:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a 
 compaction on this region.
 - Compaction queues notwithstanding, this will create a 192MB file, not 
 triggering a split based on max region size (hbase.hregion.max.filesize).
 - You'll then flush two more 64MB MemStores and hit the compactionThreshold 
 and trigger a compaction.
 - You end up with 192 + 64 + 64 in a single compaction.  This will create a 
 single 320MB and will trigger a split.
 - While you are performing the compaction (which now writes out 64MB more 
 than the split size, so is about 5X slower than the time it takes to do a 
 single flush), you are still taking on additional writes into MemStore.
 - Compaction finishes, decision to split is made, region is closed.  The 
 region now has to flush whichever edits made it to MemStore while the 
 compaction ran.  This flushing, in our tests, is by far the dominating factor 
 in how long data is unavailable during a split.  We measured about 1 second 
 to do the region closing, master assignment, reopening.  Flushing could take 
 5-6 seconds, during which time the region is unavailable.
 - The daughter regions re-open on the same RS.  Immediately when the 
 StoreFiles are opened, a compaction is triggered across all of their 
 StoreFiles because they contain references.  Since we cannot currently split 
 a split, we need to not hang on to these references for long.
 This described behavior is really bad because of how often we have to rewrite 
 data onto HDFS.  Imports are usually just IO bound as the RS waits to flush 
 and compact.  In the above example, the first cell to be inserted into this 
 region ends up being written to HDFS 4 times (initial flush, first compaction 
 w/ no split decision, second compaction w/ split decision, third compaction 
 on daughter region).  In addition, we leave a large window where we take on 
 edits (during the second compaction of 320MB) and then must make the region 
 unavailable as we flush it.
 If we increased the compactionThreshold to be 5 and determined splits based 
 on aggregate size, the behavior becomes:
 - We 

[jira] [Commented] (HBASE-2375) Make decision to split based on aggregate size of all StoreFiles and revisit related config params

2012-02-09 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204779#comment-13204779
 ] 

stack commented on HBASE-2375:
--

Doing first bullet point only sounds good.  Lets file issues for the split 
other suggestions.

What about the other recommendations made up in the issue regards 
compactionThreshold.

Upping compactionThreshold from 3 to 5 where 5 is  than the number of flushes 
it would take to make us splittable; i.e. the intent is no compaction before 
first split.

Should we do this too as part of this issue?  We could make our flush size 256M 
and compactionThreshold 5.  Or perhaps thats too rad (thats a big Map to be 
carrying around)?  Instead up the compactionThreshold and down the default 
regionsize from 1G to 512M and keep flush at 128M?

I took a look at patch and its pretty stale now given changes that have gone in 
since.



 Make decision to split based on aggregate size of all StoreFiles and revisit 
 related config params
 --

 Key: HBASE-2375
 URL: https://issues.apache.org/jira/browse/HBASE-2375
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.20.3
Reporter: Jonathan Gray
Assignee: Jonathan Gray
Priority: Critical
  Labels: moved_from_0_20_5
 Attachments: HBASE-2375-v8.patch


 Currently we will make the decision to split a region when a single StoreFile 
 in a single family exceeds the maximum region size.  This issue is about 
 changing the decision to split to be based on the aggregate size of all 
 StoreFiles in a single family (but still not aggregating across families).  
 This would move a check to split after flushes rather than after compactions. 
  This issue should also deal with revisiting our default values for some 
 related configuration parameters.
 The motivating factor for this change comes from watching the behavior of 
 RegionServers during heavy write scenarios.
 Today the default behavior goes like this:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a 
 compaction on this region.
 - Compaction queues notwithstanding, this will create a 192MB file, not 
 triggering a split based on max region size (hbase.hregion.max.filesize).
 - You'll then flush two more 64MB MemStores and hit the compactionThreshold 
 and trigger a compaction.
 - You end up with 192 + 64 + 64 in a single compaction.  This will create a 
 single 320MB and will trigger a split.
 - While you are performing the compaction (which now writes out 64MB more 
 than the split size, so is about 5X slower than the time it takes to do a 
 single flush), you are still taking on additional writes into MemStore.
 - Compaction finishes, decision to split is made, region is closed.  The 
 region now has to flush whichever edits made it to MemStore while the 
 compaction ran.  This flushing, in our tests, is by far the dominating factor 
 in how long data is unavailable during a split.  We measured about 1 second 
 to do the region closing, master assignment, reopening.  Flushing could take 
 5-6 seconds, during which time the region is unavailable.
 - The daughter regions re-open on the same RS.  Immediately when the 
 StoreFiles are opened, a compaction is triggered across all of their 
 StoreFiles because they contain references.  Since we cannot currently split 
 a split, we need to not hang on to these references for long.
 This described behavior is really bad because of how often we have to rewrite 
 data onto HDFS.  Imports are usually just IO bound as the RS waits to flush 
 and compact.  In the above example, the first cell to be inserted into this 
 region ends up being written to HDFS 4 times (initial flush, first compaction 
 w/ no split decision, second compaction w/ split decision, third compaction 
 on daughter region).  In addition, we leave a large window where we take on 
 edits (during the second compaction of 320MB) and then must make the region 
 unavailable as we flush it.
 If we increased the compactionThreshold to be 5 and determined splits based 
 on aggregate size, the behavior becomes:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After each MemStore flush, we calculate the aggregate size of all 
 StoreFiles.  We can also check the compactionThreshold.  For the first three 
 flushes, both would not hit the limit.  On the fourth flush, we would see 
 total aggregate size = 256MB and determine to make a split.
 

[jira] [Commented] (HBASE-2375) Make decision to split based on aggregate size of all StoreFiles and revisit related config params

2012-02-09 Thread Jean-Daniel Cryans (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204797#comment-13204797
 ] 

Jean-Daniel Cryans commented on HBASE-2375:
---

bq. Doing first bullet point only sounds good. Lets file issues for the split 
other suggestions.

Kewl.

bq. Upping compactionThreshold from 3 to 5 where 5 is  than the number of 
flushes it would take to make us splittable; i.e. the intent is no compaction 
before first split.

Sounds like a change that can have a bigger impact but that mostly helps this 
specific use case...

bq. Instead up the compactionThreshold and down the default regionsize from 1G 
to 512M and keep flush at 128M?

I'd rather split earlier for the first regions.

 Make decision to split based on aggregate size of all StoreFiles and revisit 
 related config params
 --

 Key: HBASE-2375
 URL: https://issues.apache.org/jira/browse/HBASE-2375
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.20.3
Reporter: Jonathan Gray
Assignee: Jonathan Gray
Priority: Critical
  Labels: moved_from_0_20_5
 Attachments: HBASE-2375-v8.patch


 Currently we will make the decision to split a region when a single StoreFile 
 in a single family exceeds the maximum region size.  This issue is about 
 changing the decision to split to be based on the aggregate size of all 
 StoreFiles in a single family (but still not aggregating across families).  
 This would move a check to split after flushes rather than after compactions. 
  This issue should also deal with revisiting our default values for some 
 related configuration parameters.
 The motivating factor for this change comes from watching the behavior of 
 RegionServers during heavy write scenarios.
 Today the default behavior goes like this:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a 
 compaction on this region.
 - Compaction queues notwithstanding, this will create a 192MB file, not 
 triggering a split based on max region size (hbase.hregion.max.filesize).
 - You'll then flush two more 64MB MemStores and hit the compactionThreshold 
 and trigger a compaction.
 - You end up with 192 + 64 + 64 in a single compaction.  This will create a 
 single 320MB and will trigger a split.
 - While you are performing the compaction (which now writes out 64MB more 
 than the split size, so is about 5X slower than the time it takes to do a 
 single flush), you are still taking on additional writes into MemStore.
 - Compaction finishes, decision to split is made, region is closed.  The 
 region now has to flush whichever edits made it to MemStore while the 
 compaction ran.  This flushing, in our tests, is by far the dominating factor 
 in how long data is unavailable during a split.  We measured about 1 second 
 to do the region closing, master assignment, reopening.  Flushing could take 
 5-6 seconds, during which time the region is unavailable.
 - The daughter regions re-open on the same RS.  Immediately when the 
 StoreFiles are opened, a compaction is triggered across all of their 
 StoreFiles because they contain references.  Since we cannot currently split 
 a split, we need to not hang on to these references for long.
 This described behavior is really bad because of how often we have to rewrite 
 data onto HDFS.  Imports are usually just IO bound as the RS waits to flush 
 and compact.  In the above example, the first cell to be inserted into this 
 region ends up being written to HDFS 4 times (initial flush, first compaction 
 w/ no split decision, second compaction w/ split decision, third compaction 
 on daughter region).  In addition, we leave a large window where we take on 
 edits (during the second compaction of 320MB) and then must make the region 
 unavailable as we flush it.
 If we increased the compactionThreshold to be 5 and determined splits based 
 on aggregate size, the behavior becomes:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After each MemStore flush, we calculate the aggregate size of all 
 StoreFiles.  We can also check the compactionThreshold.  For the first three 
 flushes, both would not hit the limit.  On the fourth flush, we would see 
 total aggregate size = 256MB and determine to make a split.
 - Decision to split is made, region is closed.  This time, the region just 
 has to flush out whichever edits made it to the MemStore during the 
 snapshot/flush 

[jira] [Commented] (HBASE-2375) Make decision to split based on aggregate size of all StoreFiles and revisit related config params

2012-02-09 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204820#comment-13204820
 ] 

stack commented on HBASE-2375:
--

bq. Upping compactionThreshold from 3 to 5... Sounds like a change that can 
have a bigger impact but that mostly helps this specific use case...

Dunno.  3 strikes me as one of those decisions that made sense long time ago 
but a bunch has changed since...  We should test it I suppose.

On changing flush/regionsize, you'd rather have us split faster then slow as 
count of regions goes up.  Ok.

 Make decision to split based on aggregate size of all StoreFiles and revisit 
 related config params
 --

 Key: HBASE-2375
 URL: https://issues.apache.org/jira/browse/HBASE-2375
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.20.3
Reporter: Jonathan Gray
Assignee: Jonathan Gray
Priority: Critical
  Labels: moved_from_0_20_5
 Attachments: HBASE-2375-v8.patch


 Currently we will make the decision to split a region when a single StoreFile 
 in a single family exceeds the maximum region size.  This issue is about 
 changing the decision to split to be based on the aggregate size of all 
 StoreFiles in a single family (but still not aggregating across families).  
 This would move a check to split after flushes rather than after compactions. 
  This issue should also deal with revisiting our default values for some 
 related configuration parameters.
 The motivating factor for this change comes from watching the behavior of 
 RegionServers during heavy write scenarios.
 Today the default behavior goes like this:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a 
 compaction on this region.
 - Compaction queues notwithstanding, this will create a 192MB file, not 
 triggering a split based on max region size (hbase.hregion.max.filesize).
 - You'll then flush two more 64MB MemStores and hit the compactionThreshold 
 and trigger a compaction.
 - You end up with 192 + 64 + 64 in a single compaction.  This will create a 
 single 320MB and will trigger a split.
 - While you are performing the compaction (which now writes out 64MB more 
 than the split size, so is about 5X slower than the time it takes to do a 
 single flush), you are still taking on additional writes into MemStore.
 - Compaction finishes, decision to split is made, region is closed.  The 
 region now has to flush whichever edits made it to MemStore while the 
 compaction ran.  This flushing, in our tests, is by far the dominating factor 
 in how long data is unavailable during a split.  We measured about 1 second 
 to do the region closing, master assignment, reopening.  Flushing could take 
 5-6 seconds, during which time the region is unavailable.
 - The daughter regions re-open on the same RS.  Immediately when the 
 StoreFiles are opened, a compaction is triggered across all of their 
 StoreFiles because they contain references.  Since we cannot currently split 
 a split, we need to not hang on to these references for long.
 This described behavior is really bad because of how often we have to rewrite 
 data onto HDFS.  Imports are usually just IO bound as the RS waits to flush 
 and compact.  In the above example, the first cell to be inserted into this 
 region ends up being written to HDFS 4 times (initial flush, first compaction 
 w/ no split decision, second compaction w/ split decision, third compaction 
 on daughter region).  In addition, we leave a large window where we take on 
 edits (during the second compaction of 320MB) and then must make the region 
 unavailable as we flush it.
 If we increased the compactionThreshold to be 5 and determined splits based 
 on aggregate size, the behavior becomes:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After each MemStore flush, we calculate the aggregate size of all 
 StoreFiles.  We can also check the compactionThreshold.  For the first three 
 flushes, both would not hit the limit.  On the fourth flush, we would see 
 total aggregate size = 256MB and determine to make a split.
 - Decision to split is made, region is closed.  This time, the region just 
 has to flush out whichever edits made it to the MemStore during the 
 snapshot/flush of the previous MemStore.  So this time window has shrunk by 
 more than 75% as it was the time to write 64MB from memory not 320MB from 
 aggregating 5 hdfs files.  This 

[jira] Commented: (HBASE-2375) Make decision to split based on aggregate size of all StoreFiles and revisit related config params

2010-08-24 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902178#action_12902178
 ] 

Jean-Daniel Cryans commented on HBASE-2375:
---

Do we have enough time do this if 0.90.0 is due for HW? Punt?

 Make decision to split based on aggregate size of all StoreFiles and revisit 
 related config params
 --

 Key: HBASE-2375
 URL: https://issues.apache.org/jira/browse/HBASE-2375
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.20.3
Reporter: Jonathan Gray
Priority: Critical
 Fix For: 0.90.0

 Attachments: HBASE-2375-v8.patch


 Currently we will make the decision to split a region when a single StoreFile 
 in a single family exceeds the maximum region size.  This issue is about 
 changing the decision to split to be based on the aggregate size of all 
 StoreFiles in a single family (but still not aggregating across families).  
 This would move a check to split after flushes rather than after compactions. 
  This issue should also deal with revisiting our default values for some 
 related configuration parameters.
 The motivating factor for this change comes from watching the behavior of 
 RegionServers during heavy write scenarios.
 Today the default behavior goes like this:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a 
 compaction on this region.
 - Compaction queues notwithstanding, this will create a 192MB file, not 
 triggering a split based on max region size (hbase.hregion.max.filesize).
 - You'll then flush two more 64MB MemStores and hit the compactionThreshold 
 and trigger a compaction.
 - You end up with 192 + 64 + 64 in a single compaction.  This will create a 
 single 320MB and will trigger a split.
 - While you are performing the compaction (which now writes out 64MB more 
 than the split size, so is about 5X slower than the time it takes to do a 
 single flush), you are still taking on additional writes into MemStore.
 - Compaction finishes, decision to split is made, region is closed.  The 
 region now has to flush whichever edits made it to MemStore while the 
 compaction ran.  This flushing, in our tests, is by far the dominating factor 
 in how long data is unavailable during a split.  We measured about 1 second 
 to do the region closing, master assignment, reopening.  Flushing could take 
 5-6 seconds, during which time the region is unavailable.
 - The daughter regions re-open on the same RS.  Immediately when the 
 StoreFiles are opened, a compaction is triggered across all of their 
 StoreFiles because they contain references.  Since we cannot currently split 
 a split, we need to not hang on to these references for long.
 This described behavior is really bad because of how often we have to rewrite 
 data onto HDFS.  Imports are usually just IO bound as the RS waits to flush 
 and compact.  In the above example, the first cell to be inserted into this 
 region ends up being written to HDFS 4 times (initial flush, first compaction 
 w/ no split decision, second compaction w/ split decision, third compaction 
 on daughter region).  In addition, we leave a large window where we take on 
 edits (during the second compaction of 320MB) and then must make the region 
 unavailable as we flush it.
 If we increased the compactionThreshold to be 5 and determined splits based 
 on aggregate size, the behavior becomes:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After each MemStore flush, we calculate the aggregate size of all 
 StoreFiles.  We can also check the compactionThreshold.  For the first three 
 flushes, both would not hit the limit.  On the fourth flush, we would see 
 total aggregate size = 256MB and determine to make a split.
 - Decision to split is made, region is closed.  This time, the region just 
 has to flush out whichever edits made it to the MemStore during the 
 snapshot/flush of the previous MemStore.  So this time window has shrunk by 
 more than 75% as it was the time to write 64MB from memory not 320MB from 
 aggregating 5 hdfs files.  This will greatly reduce the time data is 
 unavailable during splits.
 - The daughter regions re-open on the same RS.  Immediately when the 
 StoreFiles are opened, a compaction is triggered across all of their 
 StoreFiles because they contain references.  This would stay the same.
 In this example, we only write a given cell twice (instead of 4 times) while 
 drastically 

[jira] Commented: (HBASE-2375) Make decision to split based on aggregate size of all StoreFiles and revisit related config params

2010-06-01 Thread Jeff Whiting (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874199#action_12874199
 ] 

Jeff Whiting commented on HBASE-2375:
-

The optimizations here look great.  It seems like an additional optimization 
could still be made.  Looking at the patch, there doesn't seem to be any 
prioritization of compaction requests.  So if you have a region server that is 
in charge of a large number of regions the compaction queue can still get quite 
large and prevent more important compactions from happening in a timely manner. 
   I implemented a priority queue for compactions that may make a lot of sense 
to include with these optimizations (see HBASE-2646).  

 Make decision to split based on aggregate size of all StoreFiles and revisit 
 related config params
 --

 Key: HBASE-2375
 URL: https://issues.apache.org/jira/browse/HBASE-2375
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.20.3
Reporter: Jonathan Gray
Priority: Critical
 Fix For: 0.21.0

 Attachments: HBASE-2375-v8.patch


 Currently we will make the decision to split a region when a single StoreFile 
 in a single family exceeds the maximum region size.  This issue is about 
 changing the decision to split to be based on the aggregate size of all 
 StoreFiles in a single family (but still not aggregating across families).  
 This would move a check to split after flushes rather than after compactions. 
  This issue should also deal with revisiting our default values for some 
 related configuration parameters.
 The motivating factor for this change comes from watching the behavior of 
 RegionServers during heavy write scenarios.
 Today the default behavior goes like this:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a 
 compaction on this region.
 - Compaction queues notwithstanding, this will create a 192MB file, not 
 triggering a split based on max region size (hbase.hregion.max.filesize).
 - You'll then flush two more 64MB MemStores and hit the compactionThreshold 
 and trigger a compaction.
 - You end up with 192 + 64 + 64 in a single compaction.  This will create a 
 single 320MB and will trigger a split.
 - While you are performing the compaction (which now writes out 64MB more 
 than the split size, so is about 5X slower than the time it takes to do a 
 single flush), you are still taking on additional writes into MemStore.
 - Compaction finishes, decision to split is made, region is closed.  The 
 region now has to flush whichever edits made it to MemStore while the 
 compaction ran.  This flushing, in our tests, is by far the dominating factor 
 in how long data is unavailable during a split.  We measured about 1 second 
 to do the region closing, master assignment, reopening.  Flushing could take 
 5-6 seconds, during which time the region is unavailable.
 - The daughter regions re-open on the same RS.  Immediately when the 
 StoreFiles are opened, a compaction is triggered across all of their 
 StoreFiles because they contain references.  Since we cannot currently split 
 a split, we need to not hang on to these references for long.
 This described behavior is really bad because of how often we have to rewrite 
 data onto HDFS.  Imports are usually just IO bound as the RS waits to flush 
 and compact.  In the above example, the first cell to be inserted into this 
 region ends up being written to HDFS 4 times (initial flush, first compaction 
 w/ no split decision, second compaction w/ split decision, third compaction 
 on daughter region).  In addition, we leave a large window where we take on 
 edits (during the second compaction of 320MB) and then must make the region 
 unavailable as we flush it.
 If we increased the compactionThreshold to be 5 and determined splits based 
 on aggregate size, the behavior becomes:
 - We fill up regions, and as long as you are not under global RS heap 
 pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) 
 StoreFiles.
 - After each MemStore flush, we calculate the aggregate size of all 
 StoreFiles.  We can also check the compactionThreshold.  For the first three 
 flushes, both would not hit the limit.  On the fourth flush, we would see 
 total aggregate size = 256MB and determine to make a split.
 - Decision to split is made, region is closed.  This time, the region just 
 has to flush out whichever edits made it to the MemStore during the 
 snapshot/flush of the previous MemStore.  So this time window has shrunk by 
 more than 75% as it was