Re: [PR] Push down InList or hash table references from HashJoinExec depending on the size of the build side [datafusion]

via GitHub Thu, 20 Nov 2025 13:20:08 -0800


LiaCastaneda commented on code in PR #18393:
URL: https://github.com/apache/datafusion/pull/18393#discussion_r2547703677



##########
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##########
@@ -1471,6 +1502,29 @@ async fn collect_left_input(
     // Convert Box to Arc for sharing with SharedBuildAccumulator
     let hash_map: Arc<dyn JoinHashMapType> = hashmap.into();
 
+    let membership = if num_rows == 0 {
+        PushdownStrategy::Empty
+    } else {
+        // If the build side is small enough we can use IN list pushdown.
+        // If it's too big we fall back to pushing down a reference to the 
hash table.
+        // See `PushdownStrategy` for more details.
+        let estimated_size = left_values
+            .iter()
+            .map(|arr| arr.get_array_memory_size())

Review Comment:
   If we have a query like `SELECT * FROM t1 JOIN t2 ON t1.a = t2.x AND t1.a = 
t2.y`, `left_values` would have `t1.a `twice (same ArrayRef). Since both are 
references to the same underlying data, `estimated_size` would double count the 
memory. However, I guess this overaccounting is acceptable because we are 
estimating CPU cost?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Push down InList or hash table references from HashJoinExec depending on the size of the build side [datafusion]

Reply via email to