[ 
https://issues.apache.org/jira/browse/HBASE-29974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HBASE-29974:
-----------------------------------
    Labels: pull-request-available  (was: )

> Filter seek hints underutilized due to early circuit breaks in scan pipeline, 
> causing unnecessary cell-level iteration
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29974
>                 URL: https://issues.apache.org/jira/browse/HBASE-29974
>             Project: HBase
>          Issue Type: Improvement
>          Components: Filters, Scanners
>    Affects Versions: 2.6.4, 2.5.13
>            Reporter: Shubham Roy
>            Assignee: Shubham Roy
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Summary
> The filter seek-hint infrastructure (SEEK_NEXT_USING_HINT / getNextCellHint) 
> is only reachable through one narrow path in the scan pipeline. Multiple 
> earlier circuit breaks — time range mismatch, column mismatch, version 
> exhaustion, and filterRowKey rejection — all short-circuit before the filter 
> is consulted, forcing the scanner to advance one cell at a time even when the 
> filter could provide a large forward jump.
> h1. Background
> HBase's filter API supports SEEK_NEXT_USING_HINT + getNextCellHint() to allow 
> a filter to tell the scanner "jump directly to this cell, skipping everything 
> in between." This is the most powerful skip primitive available. However, it 
> is only reachable via one path in matchColumn:
> {code:java}
> // All three must pass for filterCell to be reached:
> tr.compare(timestamp) == 0       // time range gate
> columns.checkColumn() == INCLUDE  // column gate
> columns.checkVersions() == INCLUDE* // version gate
> → filter.filterCell(cell)         // only here can SEEK_NEXT_USING_HINT be 
> returned
> {code}
> Every other code path bypasses filterCell entirely.
> h1. Problem
> h2. Problem 1 — Uninteresting rows (filterRowKey=true)
> When filterRowKey() returns true, the scanner calls nextRow(), which scans 
> forward one cell at a time via storeHeap.next(MOCKED_LIST). Inside this path, 
> matcher.match() is called per cell, but filterCell is only reached if a cell 
> passes the time range check. For rows with no cells in the scan's time range, 
> the time range gate fires for every cell, filterCell is never called, and the 
> filter's hint is unreachable. The scanner pays O(cells-in-row) cost per 
> rejected row rather than seeking directly to the next location.
> h2. Problem 2 — Rows with cells outside the time range (filterRowKey=false)
> Even when a row is not rejected at the row key level, cells outside the time 
> range hit:
> {code:java}
> if (tsCmp > 0) { return MatchCode.SKIP; }               // filter bypassed
> if (tsCmp < 0) { return columns.getNextRowOrNextColumn; } // filter bypassed
> {code}
> The filter is never consulted. If the filter could determine a better skip 
> target for these cells, that capability is wasted.
> h2. Problem 3 — Cells failing column or version gates (filterRowKey=false, 
> cell in time range)
> Even for cells within the time range, two further gates can short-circuit 
> before filterCell:
> # checkColumn() ≠ INCLUDE → returns column-tracker hint (SEEK_NEXT_COL) 
> without consulting filter
> # checkVersions() = SKIP or SEEK_NEXT_COL → returns without consulting filter
> The column tracker can only suggest the next column or row. The filter may 
> know a much better target (e.g., skip several columns, or skip to a 
> completely different row), but is never asked.
> h1. Impact
> In all three cases, the scanner is forced into a cell-by-cell or row-by-row 
> iteration that it could avoid if the filter's hint were consulted. Filters 
> with efficient seeking logic (e.g., FuzzyRowFilter, ColumnRangeFilter, custom 
> range filters) incur unnecessary I/O proportional to the number of skipped 
> cells/rows.
> h1. Root Cause
> The filter hint mechanism and the scan pipeline's short-circuit mechanism are 
> disconnected. Short-circuits exist for correctness and efficiency reasons 
> (time range, column set, version limits), but they each bypass the filter as 
> a side effect. The filter has no opportunity to provide a hint unless a cell 
> passes every prior gate.
> h1. Solution
> Two new purpose-built API methods are introduced on Filter (with concrete 
> default implementations returning null for full backward compatibility):
> Filter.getHintForRejectedRow(Cell firstRowCell)
> Addresses Path 1. Called in RegionScannerImpl immediately after 
> filterRowKey() returns true, instead of calling filterCell(). Gives the 
> filter an opportunity to provide a seek target to bypass row-by-row scanning.
> Contract:
> * Only called after filterRowKey returns true for the same cell
> * May use state derived from filterRowKey (e.g., current range pointer in 
> MultiRowRangeFilter)
> * Must not invoke filterCell logic — callers guarantee filterCell has not 
> been called for this row
> * Default returns null (falls through to existing nextRow() behavior)
> Filter.getSkipHint(Cell skippedCell)
> Addresses Path 2. Called at every structural short-circuit in matchColumn 
> before filterCell is reached. Gives the filter an opportunity to provide a 
> seek target for cells skipped by the time range, column, or version gate.
> Contract:
> * May be called for cells that have not been passed through filterCell
> * Must not modify filter state (completely stateless)
> * Only filters with immutable, configuration-based hint computation should 
> override this
> * Default returns null (falls through to existing skip/seek behavior)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to