Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

via GitHub Wed, 31 Jul 2024 03:36:22 -0700


alamb commented on issue #11680:
URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2260206257


   Thank you @jayzhan211  -- that is some interesting results. 
   
   I think it makes sense that reusing the hash values is helpful mostly for 
high cardinality aggregates as in that case the number of rows that need to be 
repartitioned /rehashed is high.
   
   > Alternative idea for improvement is, if we can combine partial group + 
repartition + final group in one operation. We could probably avoid converting 
to row once again in final group.
   
   I think this is the approach taken by systems like DuckDB as I understand it 
and I think it is quite intregruing to consider
   
   The challenge of the approach would be the software engineering required to 
manage the complexity of the combined multi-stage operator. I am not sure the 
functioanlity would be easy to combine without some more refactoring 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

Reply via email to