Re: [PR] fix: synchronize partition bounds reporting in HashJoin [datafusion]

via GitHub Tue, 03 Feb 2026 09:07:06 -0800


adriangb commented on PR #17452:
URL: https://github.com/apache/datafusion/pull/17452#issuecomment-3842543791


   @Dandandan could you explain your intuition behind this change introducing 
regressions (I'm not saying it didn't, we have to benchmark to confirm)?
   
   Our intuition for this not introducing performance regressions is that all 
build side partitions should finish ~ at the same time since the distribution 
of data amongst them is random if the join key is also random. This would not 
be the case if e.g. there are 2 join keys, one going into each of 2 partitions, 
and one has a lot more rows that the other in the build side. But then I think 
the negative impact on overall query performance would only happen when the 
probe side data sizes are flipped. I.e. part1 has a small build side and large 
probe side, and part2 is the opposite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: synchronize partition bounds reporting in HashJoin [datafusion]

Reply via email to