[ https://issues.apache.org/jira/browse/HBASE-10418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881622#comment-13881622 ]
Sergey Shelukhin commented on HBASE-10418: ------------------------------------------ If you have such knowledge yes. I am talking about unknown data distribution, within the same table/cf for simplicity. First, if compactions happen in normal pattern, we'll have a large file from major compaction and small files from flushes/minors. If we don't know the data distribution, what is described above would be the expected pattern... Specifically for scans, they cannot use bloom filters, and pretty much have to hit a block of each file, no matter the data distribution, right? > give blocks of smaller store files priority in cache > ---------------------------------------------------- > > Key: HBASE-10418 > URL: https://issues.apache.org/jira/browse/HBASE-10418 > Project: HBase > Issue Type: Improvement > Components: regionserver > Reporter: Sergey Shelukhin > > That's just an idea at this point, I don't have a patch nor plan to make one > in near future. > It's good for datasets that don't fit in memory especially; and if scans are > involved. > Scans (and gets in absence of bloom filters' help) have to read from all > store files. Short range request will hit one block in every file. > If small files are more likely to be entirely available in memory, on average > requests will hit less blocks from FS. > For scans that read a lot of data, it's better to read blocks in sequence > from a big file and blocks for small files from cache, rather than a mix of > FS and cached blocks from different files, because the (HBase) blocks of a > big file would be sequential in one HDFS block. -- This message was sent by Atlassian JIRA (v6.1.5#6160)