[I] Slow aggregrate query, Polars is 4 times faster for equal query [datafusion]

via GitHub Fri, 05 Sep 2025 11:37:25 -0700


valkum opened a new issue, #17446:
URL: https://github.com/apache/datafusion/issues/17446


   ### Describe the bug
   
   I encountered a slow query, which I already mentioned in Discord but got no 
answer. I went ahead and created a repro case with data I can share that 
showcases this slow query.
   
   Plans:
   ```
   DataFrame()
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                       |
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: df.name, df.group, CASE WHEN 
array_element(markets, Int64(1)) IS NOT NULL THEN markets ELSE List() END AS 
CASE WHEN markets[Int64(1)] IS NOT NULL THEN markets ELSE NULL END              
                                            |
   |               |   Aggregate: groupBy=[[df.name, df.group]], 
aggr=[[array_agg(market) AS markets]]                                           
                                                                                
                                      |
   |               |     Projection: df.name, df.group, CASE WHEN df.market IS 
NOT NULL AND df.price IS NOT NULL THEN named_struct(Utf8("market"), 
CAST(df.market AS Dictionary(UInt16, Utf8)), Utf8("price"), df.price) ELSE 
Struct({market:,price:}) END AS market   |
   |               |       TableScan: df projection=[name, group, market, 
price]                                                                          
                                                                                
                             |
   | physical_plan | ProjectionExec: expr=[name@0 as name, group@1 as group, 
CASE WHEN array_element(markets@2, 1) IS NOT NULL THEN markets@2 END as CASE 
WHEN markets[Int64(1)] IS NOT NULL THEN markets ELSE NULL END]                  
                             |
   |               |   AggregateExec: mode=FinalPartitioned, gby=[name@0 as 
name, group@1 as group], aggr=[markets], ordering_mode=Sorted                   
                                                                                
                           |
   |               |     SortExec: expr=[name@0 ASC NULLS LAST, group@1 ASC 
NULLS LAST], preserve_partitioning=[true]                                       
                                                                                
                           |
   |               |       CoalesceBatchesExec: target_batch_size=8192          
                                                                                
                                                                                
                       |
   |               |         RepartitionExec: partitioning=Hash([name@0, 
group@1], 10), input_partitions=10                                              
                                                                                
                              |
   |               |           AggregateExec: mode=Partial, gby=[name@0 as 
name, group@1 as group], aggr=[markets], ordering_mode=Sorted                   
                                                                                
                            |
   |               |             ProjectionExec: expr=[name@0 as name, group@1 
as group, CASE WHEN market@2 IS NOT NULL AND price@3 IS NOT NULL THEN 
named_struct(market, CAST(market@2 AS Dictionary(UInt16, Utf8)), price, 
price@3) END as market]                   |
   |               |               DataSourceExec: file_groups={10 groups: [, 
...]}, projection=[name, group, market, price], output_ordering=[name@0 ASC 
NULLS LAST, group@1 ASC NULLS LAST], file_type=parquet                          
                             |
   |               |                                                            
                                                                                
                                                                                
                       |
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```
   
   Hyperfine reports:
   ```
   Benchmark 1: uv run test_polars.py sample-1m.parquet
     Time (mean ± σ):     300.2 ms ± 104.0 ms    [User: 538.6 ms, System: 140.1 
ms]
     Range (min … max):   249.2 ms … 585.3 ms    10 runs
    
     Warning: The first benchmarking run for this command was significantly 
slower than the rest (585.3 ms). This could be caused by (filesystem) caches 
that were not filled until after the first run. You should consider using the 
'--warmup' option to fill those caches before the actual benchmark. 
Alternatively, use the '--prepare' option to clear the caches before each 
timing run.
    
   Benchmark 2: uv run test_datafusion.py sample-1m.parquet
     Time (mean ± σ):      1.320 s ±  0.062 s    [User: 4.919 s, System: 0.258 
s]
     Range (min … max):    1.264 s …  1.444 s    10 runs
    
   Summary
     uv run test_polars.py sample-1m.parquet ran
       4.40 ± 1.54 times faster than uv run test_datafusion.py sample-1m.parquet
   ```
   
   
   ### To Reproduce
   
   See https://github.com/valkum/polars-datafusion-comparison/tree/main for 
full repro case.
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   I am not sure if a similar file and task is part of the benchmarks. As the 
1m sample file is 14MB in size, thus it makes no sense to include such a file 
in the git repo. A smaller sample file (e.g. 10k) only results a 1.49x 
difference by Polars. Feel free to store a generated file to include this in 
the benchmarks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Slow aggregrate query, Polars is 4 times faster for equal query [datafusion]

Reply via email to