alamb commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063749417
> is it using a hash table or open addressing (df doesn't have the latter) @XiangpengHao has mentioned several times that we think DuckDB uses radix trees (which work like hash tables but save the expensive hash). He has an implementation here https://github.com/XiangpengHao/congee Maybe we could investigate using that structure rather than a hash table > is the hash-repartition taking considerable time (RepartitionExec, which involves 2 copies of the input) - what are the other engines doing? Not only is it copying the values twice it *hash*es them twice which can be quite expensive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org