Re: [PR] feat: selectivity metrics (for Explain Analyze) in Hash Join [datafusion]

via GitHub Wed, 05 Nov 2025 01:38:30 -0800


xudong963 commented on PR #18488:
URL: https://github.com/apache/datafusion/pull/18488#issuecomment-3490170648


   > Thank you, I think the implementation is correct.
   > 
   > The only consideration is performance, Hash Join implementation is 
definitely on the performance critical path, so we have to be careful not to 
introduce additional overhead. This PR should be good to go if we can verify it 
has no influence on the performance.
   > 
   > In this PR, the extra overhead is for each batch, count the 
`distinct_count` for a sorted vector like [0,1,1,2,2,2...] up to batch size 
long, it seem shouldn't be the bottleneck. (@alamb Could you help trigger then 
benchmark please?)
   > 
   > I believe these metrics provide more insight than simply computing 
`output_rows / input_rows` for equal joins. However, if they introduce 
noticeable overhead, we can move them under `ExplainAnalyzeLevel::Dev`, and 
track them only when this extra-verbose level is enabled. We should also 
document that more detailed analyze levels may incur additional execution 
overhead but offer deeper insights.
   
   +1. Better to do some profiling for the classic join patterns to have a 
clear print.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: selectivity metrics (for Explain Analyze) in Hash Join [datafusion]

Reply via email to