sahuagin opened a new pull request, #9788:
URL: https://github.com/apache/arrow-rs/pull/9788
Closes #9785
Adds `scan_filtered(num_values, out, predicate)` as a provided method on the
`Decoder` trait. The method scans up to `num_values`, appending to `out` only
values from regions where `predicate(lo, hi)` returns `true`.
**Default implementation** (all encodings): ignores the predicate, decodes
everything. Safe fallback — no behavioral change for existing decoders.
**`DeltaBitPackDecoder` override:** Computes a conservative `[lo, hi]` range
per
miniblock from `last_value`, `min_delta`, `bit_width`, and miniblock value
count.
If the predicate rejects the range, the miniblock is skipped without decoding
individual values. Three skip strategies depending on context:
- `bw=0`: arithmetic advancement of `last_value`, no bit reads.
- Terminal `bw>0`: `BitReader::skip`, no decode.
- Mid-stream `bw>0`: decode into scratch buffer to maintain `last_value`
accuracy for subsequent miniblock range checks.
The predicate contract is conservative: `false` means the region definitely
cannot match (safe to skip); `true` means it might match (decode and emit).
False positives are safe. False negatives are not permitted by
implementations.
**Benchmarks:**
```
scan_filtered on 1M-row monotone DELTA column: 1.96ms → 470µs (4.2x)
```
**Tests added:**
- Default implementation (PLAIN): predicate ignored, all values emitted.
- Delta reject-all: nothing emitted, all values consumed.
- Delta accept-all: all values emitted (identical to `get()`).
- Delta conservative overlap: miniblock accepted when range overlaps
threshold.
- Delta bw=0 reject/accept: constant column skipped or emitted O(1).
**Origin:** Range predicate pushdown into DELTA-encoded columns. The format
already carries the information needed to skip miniblocks without decoding —
the
per-miniblock `min_delta` and `bit_width` headers bound the value range.
`scan_filtered` surfaces that capability through the `Decoder` trait so
callers
can use it without knowing the encoding.
**Note on API surface:** `scan_filtered` is a provided method with a safe
default, so adding it is non-breaking. Encodings that don't have per-region
metadata (PLAIN, RLE, etc.) get the correct conservative behavior for free.
Generated-by: Claude (claude-sonnet-4-6)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]