Re: [PR] feat: Add Hash Join benchmarks [datafusion]

via GitHub Sun, 21 Sep 2025 08:22:18 -0700


2010YOUY01 commented on code in PR #17636:
URL: https://github.com/apache/datafusion/pull/17636#discussion_r2366274431



##########
benchmarks/src/hj.rs:
##########
@@ -150,6 +149,20 @@ const HASH_QUERIES: &[&str] = &[
         FULL JOIN range(30000) AS t2
           ON (t1.value % 2) = (t2.value % 2)
     "#,
+    // Q13: INNER 30K x 30K | MEDIUM ~33% | double predicate
+    r#"
+        SELECT t1.value, t2.value
+        FROM range(30000) AS t1
+        INNER JOIN range(30000) AS t2
+          ON (t1.value = t2.value) AND (t1.value > 10000 and t2.value < 20000)

Review Comment:
   It seems `(t1.value > 10000 and t2.value < 20000)` will be pushed down below 
join, instead of getting done inside `HashJoinExec`
   I think we can use `ON (t1.value = t2.value) AND ((t1.value+t2.value)%10 > 
0)` here for high selectivity



##########
benchmarks/src/hj.rs:
##########
@@ -58,10 +58,9 @@ const HASH_QUERIES: &[&str] = &[
     // equality on key + cheap filter to downselect
     r#"
         SELECT t1.value, t2.value
-        FROM range(10000) AS t1
+        FROM generate_series(0,10000, 1000) AS t1(value)

Review Comment:
   Thanks for the update. I think it's better to apply it to all other queries



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add Hash Join benchmarks [datafusion]

Reply via email to