[jira] [Commented] (HBASE-7667) Support stripe compaction

Sergey Shelukhin (JIRA) Mon, 25 Mar 2013 11:35:18 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612956#comment-13612956
 ]


Sergey Shelukhin commented on HBASE-7667:
-----------------------------------------

bq. The first half of the sentence seems to be incomplete.
bq. I think I understand what you mean - such KVs would be written by previous 
writer.
bq. Missing author, date, and JIRA pointer?
bq. I think you need to say stripe == sub-range of the region key range.  You 
almost do.  Just do it explicitly.
bq. What does this mean "and    old     boundaries      rarely, if      ever,   
moving."?  Give doc an edit?
bq. Say in doc that you mean storefile metadata else it is ambiguous.
bq. Not sure I follow here: "This       compaction      is      performed       
when    the     number  of      L0      files   exceeds some    threshold       
and     produces        the     number  of      files   equivalent      to      
the     number  of      stripes,        with    enforced        existing        
boundaries."
Fixed these.

bq. An interesting comment by LarsH recently was that maybe we should ship w 
/major compactions off; most folks don't delete
Hmm... in general I agree but we'll have to insert really good warnings 
everywhere. Can we detect if they delete? :)

bq. Missing is one a pointer at least to how it currently works (could just 
point at src file I'd say with its description of 'sigma' compactions) and a 
sentence on whats wrong w/ it

bq. Later I suppose we could have a combination of count-based and 
size-based.... if an edge stripe is N time bigger than any other, add a new 
stripe?
Yeah, it's mentioned in code comment somewhere. 

bq. I was wondering if you could make use of liang xie's bit of code for making 
keys for the block cache where he chooses a byte sequence that falls between 
the last key in the former block and the first in the next block but the key is 
shorter than either..... but it doesn't make sense here I believe; your 
boundaries have to be hard actual keys given inserts are always coming in.... 
so nevermind this suggestion.
For boundary determination it does make sense; can you point at the code? After 
cursory look I cannot find it.

bq. You write the stripe info to the storefile.  I suppose it is up to the 
hosting region whether or not it chooses to respect those boundaries.  It could 
ignore them and just respect the seqnum and we'd have the old-style storefile 
handling, right?  (Oh, I see you allow for this -- good)
Yes.


bq. Thinking on L0 again, as has been discussed, we could have flushes skip L0 
and flush instead to stripes (one flush turns into N files, one per stripe) but 
even if we had this optimization, it looks like we'd still want the L0 option 
if only for bulk loaded files or for files whose metadata makes no sense to the 
current region context. "• The     aggregate       range   of      files   
going   in      must    be      contiguous..." Not sure I follow.  Hmm... could 
do with ".... going into a compaction"
Yes, that was my thinking too.

bq. "If the     stripe  boundaries      are     changed by      compaction,     
the     entire  stripes with    old     boundaries      must    be      
replaced" ...What would bring this on? And then how would old boundaries get 
redone?  This one is a bit confusing.
Clarified. Basically one cannot have 3 files in (-inf, 3) and 3 in [3, inf), 
then take 3 and 2 respectively, and rewrite them with boundary 4, because then 
there will be a file with [3, inf) remaining that overlaps.



bq. I was going to suggest an optimization for later for the case that an L0 
fits fully inside a stripe, I was thinking you could just 'move' it into its 
respective stripe... but I suppose you can't do that because you need to write 
the metadata to put a file into a stripe...
Yeah. Also wouldn't expect it to be a common case.

bq. Would it help naming files for the stripe they belong too?  Would that 
help?  In other words do NOT write stripe data to the storefiles and just let 
the region in memory figure which stripe a file belongs too.  When we write, we 
write with say a L0 suffix.  When compacting we add S1, S2, etc suffix for 
stripe1, etc.  To figure what the boundaries of an S0 are, it'd be something 
the region knew.  On open of the store files, it could use the start and end 
keys that are currently in the file metadata to figure which stripe they fit in.
bq. Would be a bit looser.  Would allow moving a file between stripes with a 
rename only. The delete dropping section looks right.  I like the major 
compaction along a stripe only option. 
This could be done as future improvement. The implications of change of naming 
scheme for other parts of the systems need to be determined.
Also for all I know it might break snapshots (moving files does). And, code to 
figure ut stripes on the fly would be more complex.

bq. "For        empty    ranges,        empty   files   are     created."  Is 
this necessary?  Would be good to avoid doing this.
Let me think about this... 

bq. The total i/o in terms of i/o bandwidth consumed is the same. But the disk 
iops are much, much worse. And disk iops are at a premium, and "bg activity" 
like compactions should consume as few as possible.
bq. Let's say we split a region into a 100 sub-regions, such that each 
sub-region is in the few 10's of MB. If the data is written uniformly randomly, 
each sub-region will write out a store at approx the same time. That is, a RS 
will write 100x more files into HDFS (100x more random i/o on the local 
file-system). Next, all sub-regions will do a compaction at almost the same 
time, which is again 100x more read iops to read the old stores for merging.
Memstore for region is preserved as unified... it may be written out to 
multiple files indeed in future.

bq. One can try to stagger the compactions to avoid the sudden burst by 
incorporating, say, a queue of to-be-compacted-subregions. But while the 
sub-regions at the head of the queue will compact "in time", the ones at the 
end of the queue will have many more store files to merge, and will use much 
more than their "fair-share" of iops (not to mention that the 
read-amplification in these sub-regions will be higher too). The iops profile 
will be worse than just 100x.
In current implementation the region is limited to one compaction at a time, 
mostly for simplicity sake. Yes, if all stripes compact at the same time for 
the uniform scheme all improvement will disappear; this will have to be 
controlled if ability to do so is added.

bq. You have a point that we will be making more files in the fs.
Yeah, that is inevitable.
I hear from someone from Accumulo that they have tons of files opened without 
any problems... it may make sense to investigate if we have problems.


                
> Support stripe compaction
> -------------------------
>
>                 Key: HBASE-7667
>                 URL: https://issues.apache.org/jira/browse/HBASE-7667
>             Project: HBase
>          Issue Type: New Feature
>          Components: Compaction
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: Stripe compactions.pdf
>
>
> So I was thinking about having many regions as the way to make compactions 
> more manageable, and writing the level db doc about how level db range 
> overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, 
> Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication 
> factor.
> And I suggest the following idea, let's call it stripe compactions. It's a 
> mix between level db ideas and having many small regions.
> It allows us to have a subset of benefits of many regions (wrt reads and 
> compactions) without many of the drawbacks (managing and current 
> memstore/etc. limitation).
> It also doesn't break seqNum-based file sorting for any one key.
> It works like this.
> The region key space is separated into configurable number of fixed-boundary 
> stripes (determined the first time we stripe the data, see below).
> All the data from memstores is written to normal files with all keys present 
> (not striped), similar to L0 in LevelDb, or current files.
> Compaction policy does 3 types of compactions.
> First is L0 compaction, which takes all L0 files and breaks them down by 
> stripe. It may be optimized by adding more small files from different 
> stripes, but the main logical outcome is that there are no more L0 files and 
> all data is striped.
> Second is exactly similar to current compaction, but compacting one single 
> stripe. In future, nothing prevents us from applying compaction rules and 
> compacting part of the stripe (e.g. similar to current policy with rations 
> and stuff, tiers, whatever), but for the first cut I'd argue let it "major 
> compact" the entire stripe. Or just have the ratio and no more complexity.
> Finally, the third addresses the concern of the fixed boundaries causing 
> stripes to be very unbalanced.
> It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the 
> results out with different boundaries.
> There's a tradeoff here - if we always take 2 adjacent stripes, compactions 
> will be smaller but rebalancing will take ridiculous amount of I/O.
> If we take many stripes we are essentially getting into the 
> epic-major-compaction problem again. Some heuristics will have to be in place.
> In general, if, before stripes are determined, we initially let L0 grow 
> before determining the stripes, we will get better boundaries.
> Also, unless unbalancing is really large we don't need to rebalance really.
> Obviously this scheme (as well as level) is not applicable for all scenarios, 
> e.g. if timestamp is your key it completely falls apart.
> The end result:
> - many small compactions that can be spread out in time.
> - reads still read from a small number of files (one stripe + L0).
> - region splits become marvelously simple (if we could move files between 
> regions, no references would be needed).
> Main advantage over Level (for HBase) is that default store can still open 
> the files and get correct results - there are no range overlap shenanigans.
> It also needs no metadata, although we may record some for convenience.
> It also would appear to not cause as much I/O.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7667) Support stripe compaction

Reply via email to