alamb commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063749417

   > is it using a hash table or open addressing (df doesn't have the latter)
   
   
   @XiangpengHao  has mentioned several times that we think DuckDB uses radix 
trees (which work like hash tables but save the expensive hash).
   
   He has an implementation here
   https://github.com/XiangpengHao/congee
   
   Maybe we could investigate using that structure rather than a hash table 
   
   > is the hash-repartition taking considerable time (RepartitionExec, which 
involves 2 copies of the input) - what are the other engines doing?
   
   Not only is it copying the values twice it *hash*es them twice which can be 
quite expensive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to