LiaCastaneda commented on code in PR #18393:
URL: https://github.com/apache/datafusion/pull/18393#discussion_r2549183006
##########
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##########
@@ -1471,6 +1502,29 @@ async fn collect_left_input(
// Convert Box to Arc for sharing with SharedBuildAccumulator
let hash_map: Arc<dyn JoinHashMapType> = hashmap.into();
+ let membership = if num_rows == 0 {
+ PushdownStrategy::Empty
+ } else {
+ // If the build side is small enough we can use IN list pushdown.
+ // If it's too big we fall back to pushing down a reference to the
hash table.
+ // See `PushdownStrategy` for more details.
+ let estimated_size = left_values
+ .iter()
+ .map(|arr| arr.get_array_memory_size())
Review Comment:
Yeah, since `estimated_size` is used to estimate CPU time spent building the
filter (rather than actual memory), it makes sense to 'double account' because
in theory its ~double the CPU work for building the filter I guess
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]