[I] Further improve performance of IN list evaluation [datafusion]

via GitHub Tue, 09 Dec 2025 08:53:42 -0800


geoffreyclaude opened a new issue, #19241:
URL: https://github.com/apache/datafusion/issues/19241


   ## Related Issues
   
   - Follow-up to #18824 (Restore IN_LIST performance -- Implement specialized 
`StaticFilters` for different data types)
   - POC implementation: https://github.com/pydantic/datafusion/pull/48
   
   ## Motivation
   
   IN list filters have become **critical-path operations** for dynamic filter 
pushdown. When scanning large tables with partition pruning or dynamic filters, 
the IN expression is evaluated millions of times per query. The current generic 
implementation leaves significant performance on the table.
   
   The POC demonstrates **30-78% speedups** on primitive types and **up to 43% 
speedups** on string types by exploiting type-specific optimizations that the 
compiler cannot infer from generic code.
   
   ## High-Level Optimization Strategy
   
   ### 1. Const-Generic Branchless Evaluation
   
   For small IN lists (≤16 elements), use compile-time-known array sizes (`[T; 
N]` instead of `Vec<T>`). This enables:
   - Loop unrolling and SIMD vectorization
   - Branch elimination (no conditional jumps)
   - Register-resident data (no heap access)
   
   ### 2. Type Normalization
   
   For equality comparison, only bit patterns matter. Normalize types to reduce 
code paths:
   - Signed integers → Unsigned (`Int32` → `UInt32`)
   - Floats → Unsigned (`Float32` → `UInt32`)
   - Short `Utf8View` strings (≤12 bytes) → 128-bit integers
   
   This is zero-cost at runtime (buffer pointer cast, no data copy).
   
   ### 3. Tiered Lookup Strategy
   
   Select the optimal algorithm based on list size:
   | List Size | Strategy | Rationale |
   |-----------|----------|-----------|
   | 1-16 | Branchless OR-chain | Parallel comparison in registers |
   | 17-32 | Binary search | O(log n) with good cache locality |
   | >32 | Hash set | O(1) amortized |
   
   ### 4. Short-String Fast Path
   
   `Utf8View` stores strings ≤12 bytes inline. Reinterpret the 16-byte view 
struct as a 128-bit integer to turn string comparison into a single integer 
comparison.
   
   ## Benchmark Highlights
   
   Most impactful improvements (POC vs. 
[#18832](https://github.com/apache/datafusion/pull/18832)):
   
   | Benchmark | Speedup |
   |-----------|---------|
   | `Float32/list=3/nulls=0%` | **-78%** (2.20 µs → 485 ns) |
   | `Float32/list=8/nulls=0%` | **-77%** (2.94 µs → 677 ns) |
   | `Int32/list=8/nulls=0%` | **-63%** (1.88 µs → 688 ns) |
   | `Int32/list=3/nulls=0%` | **-62%** (1.41 µs → 531 ns) |
   | `Utf8View/list=3/str=3` | **-43%** (2.18 µs → 1.25 µs) |
   | `Utf8View/list=3/str=12` | **-42%** (2.19 µs → 1.27 µs) |
   | `Utf8/list=100/str=3` | **-30%** (8.02 µs → 5.62 µs) |
   
   ## Proposed Implementation Plan
   
   To make this reviewable, I propose splitting into multiple PRs:
   
   ### PR 1: Const-Generic Branchless Filter for Primitives
   - [ ] Add `BranchlessFilter<T, N>` with compile-time known sizes (1-16)
   - [ ] Add `FilterStrategy` enum and tiered `select_strategy` 
(branchless/binary/hash)
   - [ ] Add unified `PrimitiveFilter<T, S>` with `LookupStrategy` trait
   - [ ] Cover unsigned types: `UInt8/16/32/64`
   
   ### PR 2: Type Normalization
   - [ ] Add `TransformingFilter` wrapper for zero-cost type reinterpretation
   - [ ] Signed → Unsigned: `Int8/16/32/64` → `UInt8/16/32/64`
   - [ ] Float → Unsigned: `Float32` → `UInt32`, `Float64` → `UInt64`
   
   ### PR 3: Utf8View Short-String Optimization  
   - [ ] Implement short-string (≤12 bytes) reinterpretation as 
`i128`/`Decimal128`
   - [ ] Integrate with branchless filter for small lists
   - [ ] Fallback to hash for long strings or large lists
   
   ### PR 4: Remaining Types (if needed)
   - [ ] Boolean
   - [ ] Decimal128/256
   - [ ] Binary/LargeBinary/BinaryView
   
   ## Open Questions
   
   1. **Threshold tuning**: The cutoffs (16 for branchless, 32 for binary) are 
based on microbenchmarks. Should we tune for specific workloads?
   2. **Code size**: Const-generic instantiation creates multiple monomorphized 
versions. Is the binary size increase acceptable?
   3. **Dictionary arrays**: Current POC unpacks dictionaries. Should we 
optimize the dictionary case separately?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Further improve performance of IN list evaluation [datafusion]

Reply via email to