[ 
https://issues.apache.org/jira/browse/HBASE-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Rodionov updated HBASE-14477:
--------------------------------------
    Summary: Compaction improvements: Generational compaction policy  (was: 
Compaction improvements: generational compaction)

> Compaction improvements: Generational compaction policy
> -------------------------------------------------------
>
>                 Key: HBASE-14477
>                 URL: https://issues.apache.org/jira/browse/HBASE-14477
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>             Fix For: 2.0.0
>
>
> For immutable and mostly immutable data the current SizeTiered-based 
> compaction policy is not efficient. 
> # There is no need to compact all files into one, because, data is (mostly) 
> immutable and we do not need to collect garbage. (performance reason will be 
> discussed later)
> # Size-tiered compaction is not suitable for applications where most recent 
> data is most important and prevents efficient caching of this data. 
> The idea of generational compaction policy is pretty similar to 
> DateTieredCompaction in Cassandra:
> # Memstore flushes creates files of Gen0.
> # Only store files of the same generation can be compacted. 
> # Once number of files in GenK reaches N (default, 5) they get compacted and 
> one file of Gen(K+1) is created.
> # Compaction stops at predefined generation M (default, 3).
> Simple math. For the sake of simplicity, let us say that flush size is 30MB.
> Gen0: 4*30 = 120MB 
> Gen1: 4*120 = 480MB
> Gen2: 4*480MB = 1.92GB
> Gen3: R * 1.92GB (Gen3 by default is not compacted)
> With 3-4 files in Gen3 we get total Region size 10-12GB, 10-20% (Gen0, Gen1 
> and most of Gen2) can be kept in a block cache.
> Generational compaction does not limit region size, one can use 100GB or even 
> more because total compaction IO per region can be limited and, generally 
> speaking, does not depend on region size explicitly (as in Size Tiered 
> compaction policy)
> Now, about performance implications:
> SSD-based servers will benefit this policy because they provide more than 
> adequate random IO ... but even HDD-based system can use this policy. Again, 
> simple math: with region size ~ 10GB we will have ~ 16 files, of which, 10-12 
> can be cached in a block cache. Even if request touches all the files (spans 
> the all time range) it will need to access to only 4-6 files. How to keep 
> always recent data in a block cache is totally separate topic (JIRA). 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to