alamb commented on issue #6782:
URL: 
https://github.com/apache/arrow-datafusion/issues/6782#issuecomment-1767079553

   Here is my current analysis of the ClickBench queries; 
   
   > Table \ref{table:clickbench-performance} shows query execution time for 
the ClickBench queries. DataFusion performs better on queries that have highly 
selective predicates such as Q2 and Q8, likely due to its ability to skip 
entire row groups based on the predicates. DataFusion also does better for 
queries with very few groups such as Q1, Q3, and Q4 which is likely due to 
lower per parquet file overhead and more vectorized aggregate updates. For 
queries with medium selectivity and medium group cardinally, such as Q15, Q31 
and  Q32 the engines are much closer.  For queries that have high grouping 
cardinally (10M or more) such as Q18, Q19, and Q33, DuckDB performs around 2x 
faster, which we believe is due to its highly optimized group by 
aggregation\cite{DuckDbParallelGroupAggregation}. Note the absolte wallclock 
time of these high cardinality queries is also large, so DuckDB's performance 
in that area showing how important this optimization is.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to