[
https://issues.apache.org/jira/browse/HBASE-29974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HBASE-29974:
-----------------------------------
Labels: pull-request-available (was: )
> Filter seek hints underutilized due to early circuit breaks in scan pipeline,
> causing unnecessary cell-level iteration
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29974
> URL: https://issues.apache.org/jira/browse/HBASE-29974
> Project: HBase
> Issue Type: Improvement
> Components: Filters, Scanners
> Affects Versions: 2.6.4, 2.5.13
> Reporter: Shubham Roy
> Assignee: Shubham Roy
> Priority: Major
> Labels: pull-request-available
>
> h1. Summary
> The filter seek-hint infrastructure (SEEK_NEXT_USING_HINT / getNextCellHint)
> is only reachable through one narrow path in the scan pipeline. Multiple
> earlier circuit breaks — time range mismatch, column mismatch, version
> exhaustion, and filterRowKey rejection — all short-circuit before the filter
> is consulted, forcing the scanner to advance one cell at a time even when the
> filter could provide a large forward jump.
> h1. Background
> HBase's filter API supports SEEK_NEXT_USING_HINT + getNextCellHint() to allow
> a filter to tell the scanner "jump directly to this cell, skipping everything
> in between." This is the most powerful skip primitive available. However, it
> is only reachable via one path in matchColumn:
> {code:java}
> // All three must pass for filterCell to be reached:
> tr.compare(timestamp) == 0 // time range gate
> columns.checkColumn() == INCLUDE // column gate
> columns.checkVersions() == INCLUDE* // version gate
> → filter.filterCell(cell) // only here can SEEK_NEXT_USING_HINT be
> returned
> {code}
> Every other code path bypasses filterCell entirely.
> h1. Problem
> h2. Problem 1 — Uninteresting rows (filterRowKey=true)
> When filterRowKey() returns true, the scanner calls nextRow(), which scans
> forward one cell at a time via storeHeap.next(MOCKED_LIST). Inside this path,
> matcher.match() is called per cell, but filterCell is only reached if a cell
> passes the time range check. For rows with no cells in the scan's time range,
> the time range gate fires for every cell, filterCell is never called, and the
> filter's hint is unreachable. The scanner pays O(cells-in-row) cost per
> rejected row rather than seeking directly to the next location.
> h2. Problem 2 — Rows with cells outside the time range (filterRowKey=false)
> Even when a row is not rejected at the row key level, cells outside the time
> range hit:
> {code:java}
> if (tsCmp > 0) { return MatchCode.SKIP; } // filter bypassed
> if (tsCmp < 0) { return columns.getNextRowOrNextColumn; } // filter bypassed
> {code}
> The filter is never consulted. If the filter could determine a better skip
> target for these cells, that capability is wasted.
> h2. Problem 3 — Cells failing column or version gates (filterRowKey=false,
> cell in time range)
> Even for cells within the time range, two further gates can short-circuit
> before filterCell:
> # checkColumn() ≠ INCLUDE → returns column-tracker hint (SEEK_NEXT_COL)
> without consulting filter
> # checkVersions() = SKIP or SEEK_NEXT_COL → returns without consulting filter
> The column tracker can only suggest the next column or row. The filter may
> know a much better target (e.g., skip several columns, or skip to a
> completely different row), but is never asked.
> h1. Impact
> In all three cases, the scanner is forced into a cell-by-cell or row-by-row
> iteration that it could avoid if the filter's hint were consulted. Filters
> with efficient seeking logic (e.g., FuzzyRowFilter, ColumnRangeFilter, custom
> range filters) incur unnecessary I/O proportional to the number of skipped
> cells/rows.
> h1. Root Cause
> The filter hint mechanism and the scan pipeline's short-circuit mechanism are
> disconnected. Short-circuits exist for correctness and efficiency reasons
> (time range, column set, version limits), but they each bypass the filter as
> a side effect. The filter has no opportunity to provide a hint unless a cell
> passes every prior gate.
> h1. Solution
> Two new purpose-built API methods are introduced on Filter (with concrete
> default implementations returning null for full backward compatibility):
> Filter.getHintForRejectedRow(Cell firstRowCell)
> Addresses Path 1. Called in RegionScannerImpl immediately after
> filterRowKey() returns true, instead of calling filterCell(). Gives the
> filter an opportunity to provide a seek target to bypass row-by-row scanning.
> Contract:
> * Only called after filterRowKey returns true for the same cell
> * May use state derived from filterRowKey (e.g., current range pointer in
> MultiRowRangeFilter)
> * Must not invoke filterCell logic — callers guarantee filterCell has not
> been called for this row
> * Default returns null (falls through to existing nextRow() behavior)
> Filter.getSkipHint(Cell skippedCell)
> Addresses Path 2. Called at every structural short-circuit in matchColumn
> before filterCell is reached. Gives the filter an opportunity to provide a
> seek target for cells skipped by the time range, column, or version gate.
> Contract:
> * May be called for cells that have not been passed through filterCell
> * Must not modify filter state (completely stateless)
> * Only filters with immutable, configuration-based hint computation should
> override this
> * Default returns null (falls through to existing skip/seek behavior)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)