[I] [EPIC] Make DataFusion the top of the ClickBench Parquet leaderboard [datafusion]

via GitHub Tue, 04 Nov 2025 14:55:12 -0800


alamb opened a new issue, #18489:
URL: https://github.com/apache/datafusion/issues/18489


   ### Is your feature request related to a problem or challenge?
   
   The [ClickBench Benchmark](https://benchmark.clickhouse.com/) measures the 
performance of filtering and aggregation, two of the core 
   
   Being on top of ClickBench is somewhat of a vanity benchmark: in my opinion 
all the engines within a factor of 2 of likely have similar user experiences 
(and the exact speed will depends on real user queries, etc)
   
   That being said, the engine at the top of the benchmark is good for 
publicity and the DataFusion community is certainly not against using it as 
such (see see our blog here [Apache DataFusion is now the fastest single node 
engine for querying Apache Parquet 
files](https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/))
   
   Also, ClickBench has more recently added more realistic benchmark machines (
   
   This ticket tracks improving the ClickBench performance even more
   
   Here is where we stand with DataFusion 50 on the benchmark
   (TODO: @pmcgleenon is running over the next few days, see 
https://github.com/apache/datafusion/issues/17721#issuecomment-3488229699 -- 
and then I will update)
   
   ### Describe the solution you'd like
   
   Get DataFusion back on top of ClickBench for reading partitioned parquet
   
   While being at the absolute top might seem appealing I think it is likely 
not general purpose enough
   
   ### Describe alternatives you've considered
   
   While we could clearly implement ClickBench specific optimizations, I don't 
think that is really a valuable exercise for users. I would very much like to 
focus our efforts on actually useful optimization -- if someone wants to go 
nuts with BenchMaxxing, check out
   - https://github.com/apache/datafusion/issues/13448
   
   Real Improvements
   - [ ] https://github.com/apache/datafusion/issues/3463
   
   Potential Benchmaxxing (only really helps ClickBench) improvements
   - [ ] https://github.com/apache/datafusion/issues/15524
   - [ ] 
   
   
   What I would like is of people profile queries and try and find ways to 
improve the queries
   
   ### Additional context
   
   See related discussions on
   - https://github.com/apache/datafusion/issues/14586
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [EPIC] Make DataFusion the top of the ClickBench Parquet leaderboard [datafusion]

Reply via email to