akoshchiy commented on issue #15607:
URL: https://github.com/apache/datafusion/issues/15607#issuecomment-3801609515
I've been working on range calculate optimization and have some results in
#20014. It uses the arrow's `DynComparator` for comparing array elements and
arrow's kernels for precomputing bounds with deltas.
Here is a benchmark results on my pc (ryzen 9900x, ubuntu 25.10)
```
# before
./target/release-nonlto/deps/window_query_sql-cfc4a16626956623 --bench
--noplot
Gnuplot not found, using plotters backend
window empty over, aggregate functions
time: [6.5153 ms 6.5321 ms 6.5511 ms]
Found 13 outliers among 100 measurements (13.00%)
4 (4.00%) high mild
9 (9.00%) high severe
Benchmarking window empty over, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 12.8s, or reduce sample count to 30.
window empty over, built-in functions
time: [127.21 ms 128.60 ms 129.98 ms]
Benchmarking window order by, aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 97.7s, or reduce sample count to 10.
window order by, aggregate functions
time: [984.48 ms 987.61 ms 990.80 ms]
Benchmarking window order by and range offsets, aggregate functions: Warming
up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 371.7s, or reduce sample count to 10.
window order by and range offsets, aggregate functions
time: [3.6569 s 3.6602 s 3.6637 s]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking window order by, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 86.9s, or reduce sample count to 10.
window order by, built-in functions
time: [868.42 ms 870.71 ms 873.11 ms]
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) high mild
Benchmarking window partition by, u64_wide, aggregate functions: Warming up
for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 36.3s, or reduce sample count to 10.
window partition by, u64_wide, aggregate functions
time: [352.45 ms 354.97 ms 357.51 ms]
window partition by, u64_narrow, aggregate functions
time: [9.8960 ms 9.9682 ms 10.045 ms]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
Benchmarking window partition by, u64_wide, built-in functions: Warming up
for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 39.1s, or reduce sample count to 10.
window partition by, u64_wide, built-in functions
time: [388.56 ms 390.62 ms 392.69 ms]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
window partition by, u64_narrow, built-in functions
time: [36.278 ms 36.612 ms 36.945 ms]
Benchmarking window partition and order by, u64_wide, aggregate functions:
Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 54.1s, or reduce sample count to 10.
window partition and order by, u64_wide, aggregate functions
time: [532.08 ms 533.75 ms 535.46 ms]
Benchmarking window partition and order by and range offsets, u64_wide,
aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 63.5s, or reduce sample count to 10.
window partition and order by and range offsets, u64_wide, aggregate
functions
time: [636.52 ms 638.74 ms 641.07 ms]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
Benchmarking window partition and order by, u64_narrow, aggregate functions:
Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 17.0s, or reduce sample count to 20.
window partition and order by, u64_narrow, aggregate functions
time: [168.59 ms 168.99 ms 169.42 ms]
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
Benchmarking window partition and order by and range offsets, u64_narrow,
aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 58.4s, or reduce sample count to 10.
window partition and order by and range offsets, u64_narrow, aggregate
functions
time: [572.87 ms 573.66 ms 574.46 ms]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
Benchmarking window partition and order by, u64_wide, built-in functions:
Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 51.4s, or reduce sample count to 10.
window partition and order by, u64_wide, built-in functions
time: [509.20 ms 511.51 ms 513.82 ms]
Benchmarking window partition and order by, u64_narrow, built-in functions:
Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 16.2s, or reduce sample count to 30.
window partition and order by, u64_narrow, built-in functions
time: [161.73 ms 162.18 ms 162.68 ms]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
Benchmarking window partition and order by and range offsets, u64_wide,
aggregate functions #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 64.2s, or reduce sample count to 10.
window partition and order by and range offsets, u64_wide, aggregate
functions #2
time: [635.90 ms 638.16 ms 640.43 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
# after
./target/release-nonlto/deps/window_query_sql-cfc4a16626956623 --bench
--noplot
Gnuplot not found, using plotters backend
window empty over, aggregate functions
time: [6.4957 ms 6.5093 ms 6.5242 ms]
change: [−0.7133% −0.3502% −0.0081%] (p = 0.05 >
0.05)
No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
5 (5.00%) high mild
11 (11.00%) high severe
Benchmarking window empty over, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 11.6s, or reduce sample count to 40.
window empty over, built-in functions
time: [113.10 ms 113.88 ms 114.68 ms]
change: [−12.647% −11.449% −10.347%] (p = 0.00 <
0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking window order by, aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 26.2s, or reduce sample count to 10.
window order by, aggregate functions
time: [257.99 ms 258.92 ms 259.91 ms]
change: [−73.906% −73.783% −73.653%] (p = 0.00 <
0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high mild
Benchmarking window order by and range offsets, aggregate functions: Warming
up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 57.4s, or reduce sample count to 10.
window order by and range offsets, aggregate functions
time: [556.19 ms 557.88 ms 559.59 ms]
change: [−84.813% −84.758% −84.709%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking window order by, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 15.6s, or reduce sample count to 30.
window order by, built-in functions
time: [154.50 ms 155.14 ms 155.79 ms]
change: [−82.282% −82.183% −82.098%] (p = 0.00 <
0.05)
Performance has improved.
Benchmarking window partition by, u64_wide, aggregate functions: Warming up
for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 35.7s, or reduce sample count to 10.
window partition by, u64_wide, aggregate functions
time: [350.09 ms 352.64 ms 355.17 ms]
change: [−1.6380% −0.6558% +0.3300%] (p = 0.20 >
0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) low mild
window partition by, u64_narrow, aggregate functions
time: [9.8499 ms 9.9108 ms 9.9729 ms]
change: [−1.5361% −0.5754% +0.4139%] (p = 0.25 >
0.05)
No change in performance detected.
Benchmarking window partition by, u64_wide, built-in functions: Warming up
for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 38.6s, or reduce sample count to 10.
window partition by, u64_wide, built-in functions
time: [386.84 ms 389.04 ms 391.20 ms]
change: [−1.1424% −0.4060% +0.4229%] (p = 0.31 >
0.05)
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
window partition by, u64_narrow, built-in functions
time: [32.804 ms 33.098 ms 33.394 ms]
change: [−10.649% −9.5960% −8.4033%] (p = 0.00 <
0.05)
Performance has improved.
Benchmarking window partition and order by, u64_wide, aggregate functions:
Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 61.0s, or reduce sample count to 10.
window partition and order by, u64_wide, aggregate functions
time: [608.91 ms 610.93 ms 612.95 ms]
change: [+13.925% +14.459% +14.983%] (p = 0.00 <
0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild
Benchmarking window partition and order by and range offsets, u64_wide,
aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 79.2s, or reduce sample count to 10.
window partition and order by and range offsets, u64_wide, aggregate
functions
time: [787.09 ms 788.88 ms 790.68 ms]
change: [+22.988% +23.505% +24.037%] (p = 0.00 <
0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
window partition and order by, u64_narrow, aggregate functions
time: [47.286 ms 47.538 ms 47.814 ms]
change: [−72.050% −71.869% −71.694%] (p = 0.00 <
0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
Benchmarking window partition and order by and range offsets, u64_narrow,
aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 12.1s, or reduce sample count to 40.
window partition and order by and range offsets, u64_narrow, aggregate
functions
time: [118.38 ms 119.05 ms 119.74 ms]
change: [−79.359% −79.247% −79.103%] (p = 0.00 <
0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild
Benchmarking window partition and order by, u64_wide, built-in functions:
Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 61.3s, or reduce sample count to 10.
window partition and order by, u64_wide, built-in functions
time: [612.57 ms 614.95 ms 617.33 ms]
change: [+19.513% +20.224% +20.949%] (p = 0.00 <
0.05)
Performance has regressed.
window partition and order by, u64_narrow, built-in functions
time: [41.364 ms 41.519 ms 41.679 ms]
change: [−74.516% −74.399% −74.275%] (p = 0.00 <
0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking window partition and order by and range offsets, u64_wide,
aggregate functions #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 78.6s, or reduce sample count to 10.
window partition and order by and range offsets, u64_wide, aggregate
functions #2
time: [783.53 ms 785.16 ms 786.79 ms]
change: [+22.528% +23.035% +23.519%] (p = 0.00 <
0.05)
Performance has regressed.
```
So, there is a good performance gain on queries with no partition and
low-cardinality col partitions, but also here is a regression queries on
high-cardinality col partitions because of the small batches. Also there is a
overflow regression because of a lack of saturating operations in arrow kernels.
It's still in progress, now I'm trying to overcome the 20% regression - the
main idea is use a custom array comparator which combines comparing and delta
compute, and which could be built at once.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]