alamb commented on issue #6782:
URL:
https://github.com/apache/arrow-datafusion/issues/6782#issuecomment-1767079553
Here is my current analysis of the ClickBench queries;
> Table \ref{table:clickbench-performance} shows query execution time for
the ClickBench queries. DataFusion performs better on queries that have highly
selective predicates such as Q2 and Q8, likely due to its ability to skip
entire row groups based on the predicates. DataFusion also does better for
queries with very few groups such as Q1, Q3, and Q4 which is likely due to
lower per parquet file overhead and more vectorized aggregate updates. For
queries with medium selectivity and medium group cardinally, such as Q15, Q31
and Q32 the engines are much closer. For queries that have high grouping
cardinally (10M or more) such as Q18, Q19, and Q33, DuckDB performs around 2x
faster, which we believe is due to its highly optimized group by
aggregation\cite{DuckDbParallelGroupAggregation}. Note the absolte wallclock
time of these high cardinality queries is also large, so DuckDB's performance
in that area showing how important this optimization is.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]