[ 
https://issues.apache.org/jira/browse/HBASE-10418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881607#comment-13881607
 ] 

Andrew Purtell commented on HBASE-10418:
----------------------------------------

This presumes that smaller HFiles contain the data of interest for short scans. 
What kind of mechanism to we have in place to make that more likely than not?

Would it be better to do a bit of schema design such that small files / short 
scan data is segregated to one column family and the large files / large scan 
data to another, and then prioritize in cache by column family?

> give blocks of smaller store files priority in cache
> ----------------------------------------------------
>
>                 Key: HBASE-10418
>                 URL: https://issues.apache.org/jira/browse/HBASE-10418
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Sergey Shelukhin
>
> That's just an idea at this point, I don't have a patch nor plan to make one 
> in near future.
> It's good for datasets that don't fit in memory especially; and if scans are 
> involved. 
> Scans (and gets in absence of bloom filters' help) have to read from all 
> store files. Short range request will hit one block in every file.
> If small files are more likely to be entirely available in memory, on average 
> requests will hit less blocks from FS. 
> For scans that read a lot of data, it's better to read blocks in sequence 
> from a big file and blocks for small files from cache, rather than a mix of 
> FS and cached blocks from different files, because the (HBase) blocks of a 
> big file would be sequential in one HDFS block.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to