[ https://issues.apache.org/jira/browse/HBASE-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
clockfly updated HBASE-7885: ---------------------------- Attachment: hbase_bloom_shrink_fix.patch > bloom filter compaction is too aggressive for Hfile which only contains small > count of records > ---------------------------------------------------------------------------------------------- > > Key: HBASE-7885 > URL: https://issues.apache.org/jira/browse/HBASE-7885 > Project: HBase > Issue Type: Bug > Components: Performance, Scanners > Affects Versions: 0.94.5 > Reporter: clockfly > Priority: Minor > Fix For: 0.94.5 > > Attachments: hbase_bloom_shrink_fix.patch > > > For HFile V2, the bloom filter will take a initial size, 128KB. > When there are not that much records inserted into the bloom filter, the > bloom fitler will start to shrink itself to do compaction. > For example, for 128K, it will compact to 64K > ->32K->16K->8K->4K->2K->1K->512->256->128->64->32, as long as it think that > it can be bounded by the estimate error rate. > If we puts only a few records in the HFile, the bloom filter will be > compacted to too small, then it will break the assumption that shrinking will > still be bounded by the estimated error rate. The False positive rate will > becomes un-acceptable high. > For example, if we set the expected error rate is 0.00001, for 10 records, > after compaction, The size of the bloom filter will be 64 bytes. The real > effective false positive rate will be 50%. > The use case is like this, if we are using HBase to store big record like > images, and binaries, each record will take megabytes. Then for a 128M file, > it will only contains dozens of records. > The suggested fix is to set a lower limit for the bloom filter compaction > process. I suggest to use 1000 bytes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira