[PR] Compact more aggressively in TopK based upon memory usage [datafusion]

via GitHub Mon, 16 Feb 2026 03:49:09 -0800


cetra3 opened a new pull request, #20381:
URL: https://github.com/apache/datafusion/pull/20381


   ## Which issue does this PR close?
   
   Sort of addresses https://github.com/apache/datafusion/issues/19386
   
   ## Rationale for this change
   
   Compaction in TopK values currently uses some hard set heuristics to decide 
when to compact.  Instead we can use the memory size of the batches as a bound. 
   
   ## What changes are included in this PR?
   
   Adjusts TopK compaction to compact more aggressively, based upon memory size.
   
   ## Are these changes tested?
   
   Yes and a test has been added.
   
   ## Are there any user-facing changes?
   
   No
   
   ## Benchmarks
   
   I'm struggling (in this PR and other PRs...) to get some good reliable 
benchmarks through.  
   
   However with this one it seems like it's *mostly* the same speed as `main` 
or a little faster in some cases:
   
   ```
   + critcmp main topk_memory_batch
   group                                                             main       
                            topk_memory_batch
   -----                                                             ----       
                            -----------------
   aggregate 10000000 time-series rows                               1.05     
45.5±2.13ms        ? ?/sec    1.00     43.5±2.09ms        ? ?/sec
   aggregate 10000000 worst-case rows                                1.06     
43.9±1.12ms        ? ?/sec    1.00     41.5±1.03ms        ? ?/sec
   distinct 10000000 rows asc [TopK]                                 1.00      
4.3±0.25ms        ? ?/sec    1.00      4.3±0.07ms        ? ?/sec
   distinct 10000000 rows asc [no TopK]                              1.00     
42.9±1.83ms        ? ?/sec    1.06     45.5±1.65ms        ? ?/sec
   distinct 10000000 rows desc [TopK]                                1.00      
4.3±0.07ms        ? ?/sec    1.01      4.3±0.06ms        ? ?/sec
   distinct 10000000 rows desc [no TopK]                             1.00     
42.4±1.48ms        ? ?/sec    1.05     44.7±1.88ms        ? ?/sec
   top k=10 aggregate 10000000 time-series rows                      1.13     
10.6±0.54ms        ? ?/sec    1.00      9.4±0.64ms        ? ?/sec
   top k=10 aggregate 10000000 time-series rows [Utf8View]           1.08     
11.0±0.65ms        ? ?/sec    1.00     10.2±0.48ms        ? ?/sec
   top k=10 aggregate 10000000 worst-case rows                       1.04     
16.8±1.47ms        ? ?/sec    1.00     16.1±1.44ms        ? ?/sec
   top k=10 aggregate 10000000 worst-case rows [Utf8View]            1.10     
17.9±1.51ms        ? ?/sec    1.00     16.2±1.16ms        ? ?/sec
   top k=10 string aggregate 10000000 time-series rows [Utf8View]    1.11      
9.1±0.36ms        ? ?/sec    1.00      8.2±0.35ms        ? ?/sec
   top k=10 string aggregate 10000000 time-series rows [Utf8]        1.12      
7.8±0.48ms        ? ?/sec    1.00      6.9±0.26ms        ? ?/sec
   top k=10 string aggregate 10000000 worst-case rows [Utf8View]     1.06      
8.6±0.21ms        ? ?/sec    1.00      8.1±0.20ms        ? ?/sec
   top k=10 string aggregate 10000000 worst-case rows [Utf8]         1.07      
7.4±0.17ms        ? ?/sec    1.00      6.9±0.13ms        ? ?/sec
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Compact more aggressively in TopK based upon memory usage [datafusion]

Reply via email to