Dandandan commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063754335
> > is it using a hash table or open addressing (df doesn't have the latter) > > [@XiangpengHao](https://github.com/XiangpengHao) has mentioned several times that we think DuckDB uses radix trees (which work like hash tables but save the expensive hash). > > He has an implementation here https://github.com/XiangpengHao/congee > > Maybe we could investigate using that structure rather than a hash table > > > is the hash-repartition taking considerable time (RepartitionExec, which involves 2 copies of the input) - what are the other engines doing? > > Not only is it copying the values twice it _hash_es them twice which can be quite expensive Ah nice, wasn't aware. That seems a nice direction! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org