sahuagin opened a new pull request, #9788:
URL: https://github.com/apache/arrow-rs/pull/9788

   Closes #9785
   
   Adds `scan_filtered(num_values, out, predicate)` as a provided method on the
   `Decoder` trait. The method scans up to `num_values`, appending to `out` only
   values from regions where `predicate(lo, hi)` returns `true`.
   
   **Default implementation** (all encodings): ignores the predicate, decodes
   everything. Safe fallback — no behavioral change for existing decoders.
   
   **`DeltaBitPackDecoder` override:** Computes a conservative `[lo, hi]` range 
per
   miniblock from `last_value`, `min_delta`, `bit_width`, and miniblock value 
count.
   If the predicate rejects the range, the miniblock is skipped without decoding
   individual values. Three skip strategies depending on context:
   
   - `bw=0`: arithmetic advancement of `last_value`, no bit reads.
   - Terminal `bw>0`: `BitReader::skip`, no decode.
   - Mid-stream `bw>0`: decode into scratch buffer to maintain `last_value`
     accuracy for subsequent miniblock range checks.
   
   The predicate contract is conservative: `false` means the region definitely
   cannot match (safe to skip); `true` means it might match (decode and emit).
   False positives are safe. False negatives are not permitted by 
implementations.
   
   **Benchmarks:**
   ```
   scan_filtered on 1M-row monotone DELTA column: 1.96ms → 470µs (4.2x)
   ```
   
   **Tests added:**
   - Default implementation (PLAIN): predicate ignored, all values emitted.
   - Delta reject-all: nothing emitted, all values consumed.
   - Delta accept-all: all values emitted (identical to `get()`).
   - Delta conservative overlap: miniblock accepted when range overlaps 
threshold.
   - Delta bw=0 reject/accept: constant column skipped or emitted O(1).
   
   **Origin:** Range predicate pushdown into DELTA-encoded columns. The format
   already carries the information needed to skip miniblocks without decoding — 
the
   per-miniblock `min_delta` and `bit_width` headers bound the value range.
   `scan_filtered` surfaces that capability through the `Decoder` trait so 
callers
   can use it without knowing the encoding.
   
   **Note on API surface:** `scan_filtered` is a provided method with a safe
   default, so adding it is non-breaking. Encodings that don't have per-region
   metadata (PLAIN, RLE, etc.) get the correct conservative behavior for free.
   
   Generated-by: Claude (claude-sonnet-4-6)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to