LiaCastaneda opened a new issue, #20492:
URL: https://github.com/apache/datafusion/issues/20492

   ### Is your feature request related to a problem or challenge?
   
   When the `HashJoinExec` build side returns 0 rows, the probe side stream is 
still fully consumed even though no output will be produced (for join types 
like INNER, LEFT, LEFT SEMI, etc.). We've seen queries where the probe scans 
+10 GB of data even when the build side returns no rows, and HashJoinExec 
outputs 0 rows.
   
   The short-circuit at 
[stream.rs:647](https://github.com/apache/datafusion/blob/ace9cd44b7356d60e6d69d0b98ac3f5606d55507/datafusion/physical-plan/src/joins/hash_join/stream.rs#L647)
 skips hash lookup work, but `fetch_probe_batch` is still called for every 
batch until the stream is exhausted. The transition from `WaitBuildSide` --> 
`FetchProbeBatch` is unconditional, there is no check after the build phase 
completes to decide whether the probe side needs to be polled at all.
   
   I'm not sure if this is intentional or if anything relies on the probe side 
being fully consumed in this scenario. If not, it seems like after 
`collect_build_side` completes, we could drop the probe stream immediately for 
the join types where empty build guarantees an empty output.
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to