Re: [PR] arrow-select: fuse inline Utf8View/BinaryView filter coalescing [arrow-rs]

via GitHub Tue, 02 Jun 2026 23:13:03 -0700


ClSlaid commented on PR #9755:
URL: https://github.com/apache/arrow-rs/pull/9755#issuecomment-4609603062


   Benchmark update for the latest version of this PR.
   
   I reran the full `filter:` benchmark set after reverting the dense-filter 
experiments, so the `current` numbers below correspond to the current 
per-column implementation.
   
   Command:
   
   ```bash
   CARGO_TARGET_DIR=/Users/cl/Projects/CLionProjects/arrow-rs/target \
   cargo bench -p arrow --features test_utils --bench coalesce_kernels -- 
'filter:' \
     --sample-size 10 --warm-up-time 1 --measurement-time 2 \
     --baseline pre_per_column_filter_all
   ```
   
   Definitions:
   
   - **baseline**: `apache/main` (`apache_main_filter_all`)
   - **previous**: previous PR implementation before the per-column refactor 
(`pre_per_column_filter_all`)
   - **current**: current per-column implementation
   - Relative time is normalized to baseline, so lower is better.
   - `current vs previous` uses the geomean of per-case mean times.
   - The improved/regressed/no-change counts use Criterion's confidence 
interval for current vs previous.
   
   | Group | Cases | Baseline | Previous | Current | Current vs previous | 
Current vs previous significance |
   |---|---:|---:|---:|---:|---:|---|
   | all | 104 | 1.000x | 0.905x | 0.811x | -10.4% | 59 improved / 13 regressed 
/ 32 no-change |
   | selectivity 0.001 | 26 | 1.000x | 0.769x | 0.584x | -24.1% | 23 improved / 
0 regressed / 3 no-change |
   | selectivity 0.01 | 26 | 1.000x | 0.885x | 0.755x | -14.7% | 22 improved / 
0 regressed / 4 no-change |
   | selectivity 0.1 | 26 | 1.000x | 0.987x | 0.969x | -1.8% | 10 improved / 4 
regressed / 12 no-change |
   | selectivity 0.8 | 26 | 1.000x | 0.998x | 1.011x | +1.4% | 4 improved / 9 
regressed / 13 no-change |
   
   Breakdown by null density and selectivity:
   
   | Nulls / selectivity | Cases | Baseline | Previous | Current | Current vs 
previous | Improved / regressed / no-change |
   |---|---:|---:|---:|---:|---:|---|
   | nulls 0, sel 0.001 | 13 | 1.000x | 0.730x | 0.536x | -26.6% | 13 / 0 / 0 |
   | nulls 0, sel 0.01 | 13 | 1.000x | 0.852x | 0.710x | -16.7% | 11 / 0 / 2 |
   | nulls 0, sel 0.1 | 13 | 1.000x | 0.979x | 0.951x | -2.9% | 6 / 1 / 6 |
   | nulls 0, sel 0.8 | 13 | 1.000x | 0.990x | 0.998x | +0.8% | 3 / 4 / 6 |
   | nulls 0.1, sel 0.001 | 13 | 1.000x | 0.810x | 0.635x | -21.6% | 10 / 0 / 3 
|
   | nulls 0.1, sel 0.01 | 13 | 1.000x | 0.920x | 0.802x | -12.8% | 11 / 0 / 2 |
   | nulls 0.1, sel 0.1 | 13 | 1.000x | 0.995x | 0.989x | -0.6% | 4 / 3 / 6 |
   | nulls 0.1, sel 0.8 | 13 | 1.000x | 1.006x | 1.025x | +1.9% | 1 / 5 / 7 |
   
   Largest current-vs-previous improvements:
   
   | Case | Current vs previous |
   |---|---:|
   | primitive, nulls 0, selectivity 0.001 | -67.1% |
   | primitive, nulls 0.1, selectivity 0.001 | -51.2% |
   | primitive, nulls 0, selectivity 0.01 | -47.2% |
   | mixed_utf8view max_len=20, nulls 0, selectivity 0.001 | -41.0% |
   | mixed_binaryview max_len=20, nulls 0, selectivity 0.001 | -40.2% |
   
   Largest current-vs-previous regressions:
   
   | Case | Current vs previous | Current vs baseline |
   |---|---:|---:|
   | single_utf8view, nulls 0, selectivity 0.8 | +21.9% | +0.8% |
   | single_utf8view, nulls 0.1, selectivity 0.8 | +17.1% | +3.7% |
   | primitive, nulls 0, selectivity 0.8 | +9.0% | -0.3% |
   | mixed_binaryview max_len=8, nulls 0, selectivity 0.8 | +8.9% | -0.1% |
   | mixed_binaryview max_len=128, nulls 0.1, selectivity 0.8 | +7.0% | +8.4% |
   
   Summary: the current per-column implementation is materially faster for 
sparse filters and is overall faster than both `apache/main` and the previous 
PR implementation on this benchmark set. The remaining regressions are 
concentrated in dense/high-selectivity cases, especially some `Utf8View` / 
`BinaryView` cases. I left dense-filter-specific tuning out of this PR and plan 
to treat that separately.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] arrow-select: fuse inline Utf8View/BinaryView filter coalescing [arrow-rs]

Reply via email to