JinHyuk Kim created HBASE-30174:
-----------------------------------

             Summary: Add start offset option to ROWPREFIX_FIXED_LENGTH bloom 
filter
                 Key: HBASE-30174
                 URL: https://issues.apache.org/jira/browse/HBASE-30174
             Project: HBase
          Issue Type: Task
          Components: master
            Reporter: JinHyuk Kim
            Assignee: JinHyuk Kim


h2. Problem

The {{ROWPREFIX_FIXED_LENGTH}} bloom filter always hashes the prefix starting 
from the beginning of the row key. This works well in many cases, but there are 
also schemas where the leading bytes contain low-value or repetitive data such 
as a fixed salt or bucket id.

For example, row keys like:
{code:java}
{salt}:{id1}:{id2}
{code}
may benefit more from building the bloom filter on {{id1}} rather than the 
leading salt bytes.

In those cases, hashing from offset 0 reduces the effectiveness of the bloom 
filter because part of the bloom key space is consumed by bytes that do not 
meaningfully help distinguish HFiles.
h2. Suggestion

Introduce a new optional configuration:
{code:java}
RowPrefixBloomFilter.prefix_start_offset
{code}
This allows the bloom filter to skip a configurable number of leading bytes 
before extracting the fixed-length prefix used for hashing. Defaults to 0.
The goal is to support rowkey layouts where the meaningful lookup prefix does 
not start at byte {{{}0{}}}.
h2. Usage
{code:java}
create 'test', {
  NAME => 'cf',
  BLOOMFILTER => 'ROWPREFIX_FIXED_LENGTH',
  CONFIGURATION => {
    'RowPrefixBloomFilter.prefix_length' => '8',
    'RowPrefixBloomFilter.prefix_start_offset' => '4'
  }
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to