[ 
https://issues.apache.org/jira/browse/HBASE-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858713#action_12858713
 ] 

Todd Lipcon commented on HBASE-2462:
------------------------------------

We talked about this a bit at the hackathon today.

One idea about the heuristic is to measure the actual cost of having multiple 
storefiles for a region (this was discussed a bit in HBASE-2457). The overall 
cost of having a lot of files is the cost of hitting HFiles for reads. We can 
easily measure this - whenever we access a store for a read/scan, we should 
increment a counter for that store based on how much time we spent accessing 
it. We can use this data in a number of ways:
- When deciding which files to compact, we know the "cost" of each file - if a 
file has a large cost, then including it in the compaction is worth a lot. If 
it has a small cost, we won't gain much by compacting it. We can weigh the cost 
vs the size of the file - if it has been costing us very little, but it's a big 
file, it's not worth compacting.
- We can also divide the sum of the costs by the number of reads - this gives 
us an "effective number of store files". For example, if we have 10 store 
files, but 5 of them are completely in block cache, then we effectively only 
have 5 store files from the standpoint of the benefit of a compaction. We can 
use this to prioritize compactions that will actually be helpful.



> Review compaction heuristic and move compaction code out so standalone and 
> independently testable
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2462
>                 URL: https://issues.apache.org/jira/browse/HBASE-2462
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.20.5, 0.21.0
>
>
> Anything that improves our i/o profile makes hbase run smoother.  Over in 
> HBASE-2457, good work has been done already describing the tension between 
> minimizing compactions versus minimizing count of store files.  This issue is 
> about following on from what has been done in 2457 but also, breaking the 
> hard-to-read compaction code out of Store.java out to a standalone class that 
> can be the easier tested (and easily analyzed for its performance 
> characteristics).
> If possible, in the refactor, we'd allow specification of alternate merge 
> sort implementations. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to