[
https://issues.apache.org/jira/browse/HBASE-30174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HBASE-30174:
-----------------------------------
Labels: pull-request-available (was: )
> Add start offset option to ROWPREFIX_FIXED_LENGTH bloom filter
> --------------------------------------------------------------
>
> Key: HBASE-30174
> URL: https://issues.apache.org/jira/browse/HBASE-30174
> Project: HBase
> Issue Type: Task
> Components: master
> Reporter: JinHyuk Kim
> Assignee: JinHyuk Kim
> Priority: Minor
> Labels: pull-request-available
>
> h2. Problem
> The {{ROWPREFIX_FIXED_LENGTH}} bloom filter always hashes the prefix starting
> from the beginning of the row key. This works well in many cases, but there
> are also schemas where the leading bytes contain low-value or repetitive data
> such as a fixed salt or bucket id.
> For example, row keys like:
> {code:java}
> {salt}:{id1}:{id2}
> {code}
> may benefit more from building the bloom filter on {{id1}} rather than the
> leading salt bytes.
> In those cases, hashing from offset 0 reduces the effectiveness of the bloom
> filter because part of the bloom key space is consumed by bytes that do not
> meaningfully help distinguish HFiles.
> h2. Suggestion
> Introduce a new optional configuration:
> {code:java}
> RowPrefixBloomFilter.prefix_start_offset
> {code}
> This allows the bloom filter to skip a configurable number of leading bytes
> before extracting the fixed-length prefix used for hashing. Defaults to 0.
> The goal is to support rowkey layouts where the meaningful lookup prefix does
> not start at byte {{{}0{}}}.
> h2. Usage
> {code:java}
> create 'test', {
> NAME => 'cf',
> BLOOMFILTER => 'ROWPREFIX_FIXED_LENGTH',
> CONFIGURATION => {
> 'RowPrefixBloomFilter.prefix_length' => '8',
> 'RowPrefixBloomFilter.prefix_start_offset' => '4'
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)