Re: [PR] Project record batches to avoid filtering unused columns [datafusion]

via GitHub Tue, 28 Oct 2025 10:57:34 -0700


pepijnve commented on PR #18329:
URL: https://github.com/apache/datafusion/pull/18329#issuecomment-3457779133


   Benchmark results so far. I'll do another run with all the lookup table 
ones, but those take much longer to complete.
   
   <details>
   
   ```
   case_when 8192x3: CASE WHEN c1 <= 500 THEN 1 ELSE 0 END
                           time:   [44.119 µs 44.159 µs 44.201 µs]
                           change: [-1.3187% -1.0059% -0.6888%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 6 outliers among 100 measurements (6.00%)
     3 (3.00%) high mild
     3 (3.00%) high severe
   
   case_when 8192x3: CASE WHEN c1 <= 500 THEN c2 ELSE c3 END
                           time:   [14.092 µs 14.139 µs 14.183 µs]
                           change: [-0.4325% +0.0691% +0.5941%] (p = 0.80 > 
0.05)
                           No change in performance detected.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   case_when 8192x3: CASE WHEN c1 <= 500 THEN c2 [ELSE NULL] END
                           time:   [1.6137 µs 1.6248 µs 1.6454 µs]
                           change: [-0.3208% +0.2510% +0.9935%] (p = 0.62 > 
0.05)
                           No change in performance detected.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high severe
   
   case_when 8192x3: CASE c1 WHEN 1 THEN c2 WHEN 2 THEN c3 END
                           time:   [15.638 µs 15.706 µs 15.777 µs]
                           change: [-1.6629% -0.4705% +0.4682%] (p = 0.44 > 
0.05)
                           No change in performance detected.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   Benchmarking case_when 8192x3: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 1 
... WHEN c1 == n THEN n ELSE n + 1 EN...: Collecting 100 samples in estimated 
5.6881 s (300 iteracase_when 8192x3: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 
1 ... WHEN c1 == n THEN n ELSE n + 1 EN...
                           time:   [18.497 ms 18.584 ms 18.672 ms]
                           change: [-63.275% -63.002% -62.739%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   Benchmarking case_when 8192x3: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 THEN 1 
... WHEN c1 < n * 1000 THEN n ELSE n...: Collecting 100 samples in estimated 
5.1163 s (66k iteracase_when 8192x3: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 
THEN 1 ... WHEN c1 < n * 1000 THEN n ELSE n...
                           time:   [77.671 µs 77.743 µs 77.814 µs]
                           change: [-32.757% -31.951% -31.340%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   case_when 8192x3: CASE c1 WHEN 0 THEN 0 WHEN 1 THEN 1 ... WHEN n THEN n ELSE 
n + 1 END
                           time:   [25.772 ms 25.896 ms 26.027 ms]
                           change: [-59.464% -59.166% -58.882%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     6 (6.00%) high mild
   
   case_when 8192x3: CASE c2 WHEN 0 THEN 0 WHEN 1000 THEN 1 ... WHEN n * 1000 
THEN n ELSE n + 1 END
                           time:   [80.644 µs 80.829 µs 81.013 µs]
                           change: [-29.245% -29.042% -28.841%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     4 (4.00%) high mild
   
   case_when 8192x50: CASE WHEN c1 <= 500 THEN 1 ELSE 0 END
                           time:   [44.208 µs 44.255 µs 44.306 µs]
                           change: [-6.2334% -4.3157% -2.6620%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     4 (4.00%) high mild
   
   case_when 8192x50: CASE WHEN c1 <= 500 THEN c2 ELSE c3 END
                           time:   [12.950 µs 13.125 µs 13.308 µs]
                           change: [-77.208% -76.935% -76.643%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   case_when 8192x50: CASE WHEN c1 <= 500 THEN c2 [ELSE NULL] END
                           time:   [1.6336 µs 1.6372 µs 1.6413 µs]
                           change: [+0.5128% +0.8175% +1.1256%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) low mild
   
   case_when 8192x50: CASE c1 WHEN 1 THEN c2 WHEN 2 THEN c3 END
                           time:   [15.829 µs 15.917 µs 16.039 µs]
                           change: [-74.636% -74.248% -73.752%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   
   Benchmarking case_when 8192x50: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 1 
... WHEN c1 == n THEN n ELSE n + 1 E...: Collecting 100 samples in estimated 
5.6242 s (300 iteracase_when 8192x50: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 
THEN 1 ... WHEN c1 == n THEN n ELSE n + 1 E...
                           time:   [18.634 ms 18.835 ms 19.124 ms]
                           change: [-93.215% -93.103% -92.966%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   Benchmarking case_when 8192x50: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 THEN 
1 ... WHEN c1 < n * 1000 THEN n ELSE ...: Collecting 100 samples in estimated 
5.1051 s (66k iteracase_when 8192x50: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 
THEN 1 ... WHEN c1 < n * 1000 THEN n ELSE ...
                           time:   [78.047 µs 78.193 µs 78.340 µs]
                           change: [-84.852% -84.791% -84.733%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   
   case_when 8192x50: CASE c1 WHEN 0 THEN 0 WHEN 1 THEN 1 ... WHEN n THEN n 
ELSE n + 1 END
                           time:   [26.130 ms 26.260 ms 26.395 ms]
                           change: [-91.275% -91.142% -91.017%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   case_when 8192x50: CASE c2 WHEN 0 THEN 0 WHEN 1000 THEN 1 ... WHEN n * 1000 
THEN n ELSE n + 1 END
                           time:   [79.469 µs 79.961 µs 80.443 µs]
                           change: [-84.462% -84.371% -84.290%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   case_when 8192x100: CASE WHEN c1 <= 500 THEN 1 ELSE 0 END
                           time:   [44.229 µs 44.282 µs 44.339 µs]
                           change: [-0.2623% -0.0773% +0.1124%] (p = 0.45 > 
0.05)
                           No change in performance detected.
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   case_when 8192x100: CASE WHEN c1 <= 500 THEN c2 ELSE c3 END
                           time:   [12.831 µs 13.058 µs 13.281 µs]
                           change: [-88.649% -88.422% -88.188%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   case_when 8192x100: CASE WHEN c1 <= 500 THEN c2 [ELSE NULL] END
                           time:   [1.6280 µs 1.6328 µs 1.6379 µs]
                           change: [+0.3549% +0.6314% +0.9178%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   
   case_when 8192x100: CASE c1 WHEN 1 THEN c2 WHEN 2 THEN c3 END
                           time:   [15.816 µs 15.874 µs 15.926 µs]
                           change: [-86.013% -85.925% -85.845%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 14 outliers among 100 measurements (14.00%)
     7 (7.00%) low severe
     5 (5.00%) low mild
     2 (2.00%) high mild
   
   Benchmarking case_when 8192x100: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 
1 ... WHEN c1 == n THEN n ELSE n + 1 ...: Collecting 100 samples in estimated 
5.6208 s (300 iteracase_when 8192x100: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 
THEN 1 ... WHEN c1 == n THEN n ELSE n + 1 ...
                           time:   [18.786 ms 18.899 ms 19.039 ms]
                           change: [-96.725% -96.693% -96.662%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   Benchmarking case_when 8192x100: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 THEN 
1 ... WHEN c1 < n * 1000 THEN n ELSE...: Collecting 100 samples in estimated 
5.1589 s (66k iteracase_when 8192x100: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 
THEN 1 ... WHEN c1 < n * 1000 THEN n ELSE...
                           time:   [77.981 µs 78.081 µs 78.187 µs]
                           change: [-91.952% -91.871% -91.800%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     1 (1.00%) low mild
     3 (3.00%) high severe
   
   case_when 8192x100: CASE c1 WHEN 0 THEN 0 WHEN 1 THEN 1 ... WHEN n THEN n 
ELSE n + 1 END
                           time:   [25.685 ms 25.783 ms 25.887 ms]
                           change: [-95.974% -95.893% -95.817%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     8 (8.00%) high mild
   
   Benchmarking case_when 8192x100: CASE c2 WHEN 0 THEN 0 WHEN 1000 THEN 1 ... 
WHEN n * 1000 THEN n ELSE n + 1 END: Collecting 100 samples in estimated 5.3041 
s (66k iterationscase_when 8192x100: CASE c2 WHEN 0 THEN 0 WHEN 1000 THEN 1 ... 
WHEN n * 1000 THEN n ELSE n + 1 END
                           time:   [79.641 µs 79.901 µs 80.183 µs]
                           change: [-91.560% -91.513% -91.470%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Project record batches to avoid filtering unused columns [datafusion]

Reply via email to