alamb opened a new issue, #18489: URL: https://github.com/apache/datafusion/issues/18489
### Is your feature request related to a problem or challenge? The [ClickBench Benchmark](https://benchmark.clickhouse.com/) measures the performance of filtering and aggregation, two of the core Being on top of ClickBench is somewhat of a vanity benchmark: in my opinion all the engines within a factor of 2 of likely have similar user experiences (and the exact speed will depends on real user queries, etc) That being said, the engine at the top of the benchmark is good for publicity and the DataFusion community is certainly not against using it as such (see see our blog here [Apache DataFusion is now the fastest single node engine for querying Apache Parquet files](https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/)) Also, ClickBench has more recently added more realistic benchmark machines ( This ticket tracks improving the ClickBench performance even more Here is where we stand with DataFusion 50 on the benchmark (TODO: @pmcgleenon is running over the next few days, see https://github.com/apache/datafusion/issues/17721#issuecomment-3488229699 -- and then I will update) ### Describe the solution you'd like Get DataFusion back on top of ClickBench for reading partitioned parquet While being at the absolute top might seem appealing I think it is likely not general purpose enough ### Describe alternatives you've considered While we could clearly implement ClickBench specific optimizations, I don't think that is really a valuable exercise for users. I would very much like to focus our efforts on actually useful optimization -- if someone wants to go nuts with BenchMaxxing, check out - https://github.com/apache/datafusion/issues/13448 Real Improvements - [ ] https://github.com/apache/datafusion/issues/3463 Potential Benchmaxxing (only really helps ClickBench) improvements - [ ] https://github.com/apache/datafusion/issues/15524 - [ ] What I would like is of people profile queries and try and find ways to improve the queries ### Additional context See related discussions on - https://github.com/apache/datafusion/issues/14586 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
