Re: [I] Improve performance of TPC-DS q72 [datafusion-comet]

via GitHub Wed, 03 Jul 2024 13:49:09 -0700


andygrove commented on issue #622:
URL: 
https://github.com/apache/datafusion-comet/issues/622#issuecomment-2207244670


   > Spark produces the worst possible query plan for q72
   
   Yes, it does. I am comparing like-for-like plans between Spark and Comet 
without any join reordering enabled.
   
   > Irrespective of the plan though, given the same number of input rows are 
the Comet operators also slower than the corresponding Spark operators?
   
   In both cases, Spark is executing the SortMergeJoin and the join takes 
longer when the inputs are from CometScan/CometFilter/Exchange than if they are 
from the Spark equivalents (with same number of rows in both cases).
   
   Things I have learned since filing this issue:
   
   - The time reported for the WholestageCodegen C2R is misleading. It is the 
duration of the operator, not the time spent in the operator. The reason for 
this taking so long is not necessarily the C2R conversion itself but the 
elapsed time when retrieving data from child operators (such as the 
AQEShuffleRead)
   - With Comet enabled, AQEShuffleRead is coalescing partitions down to a 
smaller number of partitions than Spark because Comet produces smaller 
partitions, thanks to columnar compression presumably


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Improve performance of TPC-DS q72 [datafusion-comet]

Reply via email to