Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-15 Thread via GitHub


jonathanc-n closed pull request #16210: feat: Support null aware + equijoins 
for `NestedLoopJoin`
URL: https://github.com/apache/datafusion/pull/16210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-14 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2147410263


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   > > If null_equal_null is true, we can replace = in the ON clause with <=> 
and merge it into the filter in the planner
   > 
   > This seems slower, we should just pass this to smj or hj. I can play 
around with the threshold logic that duckdb has for smaller tables where nljs 
may be faster.
   > 
   > >
   
   Under the premise that we prefer NLJ
   
   Rest lgtm
   
   null_equals_null equals true  + <=> condition -> we can always pass the <=> 
expressions into the on clause
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144282145


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   Ok this sounds like a good plan, I think I will try to reiterate and run 
some benchmarks to see if it is worth it. 
   
   > If null_equal_null is true, we can replace = in the ON clause with <=> and 
merge it into the filter in the planner
   This seems slower, we should just pass this to smj or hj. I can play around 
with the threshold logic that duckdb has for smaller tables where nljs may be 
faster. 
   
   > There's **no equal(`=` not `<=>`) join condition in `on`,** so the planner 
choose nlj
   This is interesting because how will sort merge join + hash join when to 
treat . And since hash join and smj are usually faster for this equi 
conditions, instead of passing in <=> into the join filter we can flip the null 
equals null bit to true and run the <=> in the `on` clause; but this would 
clash with a `=` if null_equals_null is false. The solution I was thinking of 
is these cases:
   - `null_equals_null` equals true + `<=>` condition +  `=` condition -> we 
pass the `<=>` expressions into the `on` clause
   - `null_equals_null` equals false +  `<=>` condition +  `=` condition -> 
`<=>` stays in join filter 
   - `null_equals_null` equals false + `<=>` condition + no `=` condition -> we 
flip the `null_equals_null` to true and pass the `<=>` expressions into the 
`on` clause
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144282145


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   > If null_equal_null is true, we can replace = in the ON clause with <=> and 
merge it into the filter in the planner
   
   This seems slower, we should just pass this to smj or hj. I can play around 
with the threshold logic that duckdb has for smaller tables where nljs may be 
faster. 
   
   > There's **no equal(`=` not `<=>`) join condition in `on`,** so the planner 
choose nlj
   
   Since hash join and smj are usually faster for this equi conditions, instead 
of passing in <=> into the join filter we can flip the null equals null bit to 
true and run the <=> in the `on` clause; but this would clash with a `=` if 
null_equals_null is false. The solution I was thinking of is these cases:
   - `null_equals_null` equals true + `<=>` condition +  `=` condition -> we 
pass the `<=>` expressions into the `on` clause
   - `null_equals_null` equals false +  `<=>` condition +  `=` condition -> 
`<=>` stays in join filter 
   - `null_equals_null` equals false + `<=>` condition + no `=` condition -> we 
flip the `null_equals_null` to true and pass the `<=>` expressions into the 
`on` clause
   
   WDYT @UBarney 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144201354


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   I missed that we may **convert EXCEPT to a null aware join**. Sorry for that.
   If null_equal_null is true, we can replace `=` in the ON clause with `<=>` 
and merge it into the filter in the planner. Otherwise, we merge the ON clause 
into the filter directly.
   
   > how were you able to have the query use a nested loop join with a on 
condition?
   
   There's **no equal(`=` not `<=>`) join condition in `on`,** so the planner 
choose nlj
   
   
https://github.com/apache/datafusion/blob/e5f596b59ece71329635d41fd7f147d1d702e49e/datafusion/core/src/physical_planner.rs#L1104-L



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144201354


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   I missed that we may convert EXCEPT to a null aware join. Sorry for that.
   If null_equal_null is true, we can replace `=` in the ON clause with `<=>` 
and merge it into the filter in the planner. Otherwise, we merge the ON clause 
into the filter directly.
   
   > how were you able to have the query use a nested loop join with a on 
condition?
   
   There's **no equal(`=` not `<=>`) join condition in `on`,** so the planner 
choose nlj
   
   
https://github.com/apache/datafusion/blob/e5f596b59ece71329635d41fd7f147d1d702e49e/datafusion/core/src/physical_planner.rs#L1104-L



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144201354


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   I missed that we may convert EXCEPT to a null aware join. Sorry for that.
   If null_equal_null is true, we can replace `=` in the ON clause with `<=>` 
and merge it into the filter in the planner. Otherwise, we merge the ON clause 
into the filter directly.
   
   > how were you able to have the query use a nested loop join with a on 
condition?
   
   There's **no equal(`=`) join condition in `on`,** so the planner choose nlj
   
   
https://github.com/apache/datafusion/blob/e5f596b59ece71329635d41fd7f147d1d702e49e/datafusion/core/src/physical_planner.rs#L1104-L



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144201354


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   I missed that we may convert EXCEPT to a null aware join. Sorry for that.
   If null_equal_null is true, we can replace `=` in the ON clause with `<=>` 
and merge it into the filter in the planner. Otherwise, we merge the ON clause 
into the filter directly.
   
   > how were you able to have the query use a nested loop join with a on 
condition?
   
   There's **no equal join condition in `on`,** so the planner choose nlj
   
   
https://github.com/apache/datafusion/blob/e5f596b59ece71329635d41fd7f147d1d702e49e/datafusion/core/src/physical_planner.rs#L1104-L



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144169706


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   I am curious though, how were you able to have the query use a nested loop 
join with a `on` condition?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144167426


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   Oh yeah that makes sense, just have it be merged in the join filter. 
:facepalm: Thanks for finding this πŸ˜„. I think I'll just put out a pull request 
removing the todo statement, that statement led me to think that null aware 
wasn't supported yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144121144


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   Let's just skip `ExtractEquijoinPredicate` if we prefer NLJ.
   
https://github.com/apache/datafusion/blob/8d79bddeda71bc415ba9bf85e77476ab1784c665/datafusion/optimizer/src/extract_equijoin_predicate.rs#L30-L43
   
   
   I assume we achieve the behavior of `null_equals_null` by using `<=>`. For 
example: `select t1.value from range(6) t1 join range(6) t2 on t1.value <=> 
t2.value;`. The current spaceship operator can already handle this.
   
   
https://github.com/apache/datafusion/blob/916ccc39f2af1aa3fb01f1c7e406fbcdbc03cc13/datafusion/sqllogictest/test_files/expr.slt#L1595-L1612
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144133973


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   Looks like current nlj already supports null_equal_null πŸ˜‚
   
   
   
   ```
   > explain SELECT
   *
   FROM
   (VALUES (1, 'Apple'), (2, 'Banana'), (NULL, 'Cherry')) AS i(id, name)
   JOIN
   (VALUES (1, 'Fruit'), (3, 'Vegetable'), (NULL, 'Unknown')) AS c(id, 
category)
   ON
   i.id <=> c.id;
   
   
+---++
   | plan_type | plan   
|
   
+---++
   | physical_plan | β”Œβ”€β”€β”€β”  
|
   |   | β”‚ NestedLoopJoinExecβ”œβ”€β”€β”   
|
   |   | β””β”€β”¬β”€β”˜  β”‚   
|
   |   | β”Œβ”€β”΄β”€β”β”Œβ”€β”΄β”€β” 
|
   |   | β”‚   ProjectionExec  β”‚β”‚   ProjectionExec  β”‚ 
|
   |   | β”‚   β”‚β”‚   β”‚ 
|
   |   | β”‚id: column1β”‚β”‚ category: column2 β”‚ 
|
   |   | β”‚   name: column2   β”‚β”‚id: column1β”‚ 
|
   |   | β””β”€β”¬β”€β”˜β””β”€β”¬β”€β”˜ 
|
   |   | β”Œβ”€β”΄β”€β”β”Œβ”€β”΄β”€β” 
|
   |   | β”‚   DataSourceExec  β”‚β”‚   DataSourceExec  β”‚ 
|
   |   | β”‚   β”‚β”‚   β”‚ 
|
   |   | β”‚bytes: 1400β”‚β”‚bytes: 1400β”‚ 
|
   |   | β”‚   format: memory  β”‚β”‚   format: memory  β”‚ 
|
   |   | β”‚  rows: 1  β”‚β”‚  rows: 1  β”‚ 
|
   |   | β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜ 
|
   |   |
|
   
+---++
   1 row(s) fetched.
   Elapsed 0.001 seconds.
   
   
   > SELECT
   *
   FROM
   (VALUES (1, 'Apple'), (2, 'Banana'), (NULL, 'Cherry')) AS i(id, name)
   JOIN
   (VALUES (1, 'Fruit'), (3, 'Vegetable'), (NULL, 'Unknown')) AS c(id, 
category)
   ON
   i.id <=> c.id;
   
   +--++--+--+
   | id   | name   | id   | category |
   +--++--+--+
   | 1| Apple  | 1| Fruit|
   | NULL | Cherry | NULL | Unknown  |
   +--++--+--+
   2 row(s) fetched.
   Elapsed 0.001 seconds.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144121144


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   Let's just skip ExtractEquijoinPredicate if we prefer NLJ.
   
https://github.com/apache/datafusion/blob/8d79bddeda71bc415ba9bf85e77476ab1784c665/datafusion/optimizer/src/extract_equijoin_predicate.rs#L30-L43
   
   
   I assume we can achieve the behavior of `null_equals_null` by using `<=>`. 
For example: `select t1.value from range(6) t1 join range(6) t2 on t1.value <=> 
t2.value;`. The current spaceship operator can already handle this.
   
   
https://github.com/apache/datafusion/blob/916ccc39f2af1aa3fb01f1c7e406fbcdbc03cc13/datafusion/sqllogictest/test_files/expr.slt#L1595-L1612
   



##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   Let's just skip `ExtractEquijoinPredicate` if we prefer NLJ.
   
https://github.com/apache/datafusion/blob/8d79bddeda71bc415ba9bf85e77476ab1784c665/datafusion/optimizer/src/extract_equijoin_predicate.rs#L30-L43
   
   
   I assume we can achieve the behavior of `null_equals_null` by using `<=>`. 
For example: `select t1.value from range(6) t1 join range(6) t2 on t1.value <=> 
t2.value;`. The current spaceship operator can already handle this.
   
   
https://github.com/apache/datafusion/blob/916ccc39f2af1aa3fb01f1c7e406fbcdbc03cc13/datafusion/sqllogictest/test_files/expr.slt#L1595-L1612
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2143227797


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   But how would you suggest doing the null_equal_null check. I do think that 
it willl probably be faster to merge the two.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2142968386


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   > to make sure that we can do the null_equals_null check
   
   We can also do this by merging on condition into filter.
   For example: filter: `t1.a < t2.a, on: t1.c = t2.c` -> `filter:  t1.a < t2.a 
and (t1.c = t2.c or (is_null(t1.c) and (is_null(t2.c`
   
   > this is how it is implemented for the other joins as well
   
   Hash Joins and Sort-Merge Joins are fundamentally reliant on equality 
conditions. Their high-speed mechanisms (hashing, merging) are impossible 
without them.
   But for nlj separating equality conditions doesn't improve performance



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-12 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2142968386


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   > we can do the null_equals_null check
   
   We can do this by merging on condition into filter.
   For example: filter: `t1.a < t2.a, on: t1.c = t2.c` -> `filter:  t1.a < t2.a 
and (t1.c = t2.c or (is_null(t1.c) and (is_null(t2.c`
   
   > this is how it is implemented for the other joins as well
   
   Hash Joins and Sort-Merge Joins are fundamentally reliant on equality 
conditions. Their high-speed mechanisms (hashing, merging) are impossible 
without them.
   But for nlj separating equality conditions doesn't improve performance



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-11 Thread via GitHub


jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2141190301


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   > Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` . 
`on.is_empty() == true` mean no equijoin columns.
   
   Because nested loop joins exclusively take in filter clauses more often than 
not, I decided it'd be easier to check if `on.is_some()`. Currently the 
physical planner doesnt take in nested loop join for equijoins on purpose, this 
will be added in another pr.
   
   > If we can merge `on` condition into `filter`, we can remove this field and 
reuse existing filter logic. For example: `filter: t1.a < t2.a, on: t1.c = 
t2.c` -> `filter: t1.a < t2.a and t1.c = t2.c`
   I separated merging the filter + on condition to make sure that we can do 
the null_equals_null check (this is how it is implemented for the other joins 
as well).



##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   > Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` . 
`on.is_empty() == true` mean no equijoin columns.
   
   Because nested loop joins exclusively take in filter clauses more often than 
not, I decided it'd be easier to check if `on.is_some()`. Currently the 
physical planner doesnt take in nested loop join for equijoins on purpose, this 
will be added in another pr.
   
   > If we can merge `on` condition into `filter`, we can remove this field and 
reuse existing filter logic. For example: `filter: t1.a < t2.a, on: t1.c = 
t2.c` -> `filter: t1.a < t2.a and t1.c = t2.c`
   
   I separated merging the filter + on condition to make sure that we can do 
the null_equals_null check (this is how it is implemented for the other joins 
as well).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-11 Thread via GitHub


jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2141190301


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   > Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` . 
`on.is_empty() == true` mean no equijoin columns.
   Because nested loop joins exclusively take in filter clauses more often than 
not, I decided it'd be easier to check if `on.is_some()`. Currently the 
physical planner doesnt take in nested loop join for equijoins on purpose, this 
will be added in another pr.
   
   > If we can merge `on` condition into `filter`, we can remove this field and 
reuse existing filter logic. For example: `filter: t1.a < t2.a, on: t1.c = 
t2.c` -> `filter: t1.a < t2.a and t1.c = t2.c`
   I separated merging the filter + on condition to make sure that we can do 
the null_equals_null check (this is how it is implemented for the other joins 
as well).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-11 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2140090439


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` . 
`on.is_empty() == true` mean no equijoin columns.
   



##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   If we can merge `on` condition into `filter`, we can remove this field and 
reuse existing filter logic.
   For example: `filter: t1.a < t2.a, on: t1.c = t2.c` -> `filter:  t1.a < t2.a 
and t1.c = t2.c`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-11 Thread via GitHub


UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2140108914


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   If we can merge `on` condition into `filter`, we can remove this field and 
reuse current existing filter logic.
   For example: `filter: t1.a < t2.a, on: t1.c = t2.c` -> `filter:  t1.a < t2.a 
and t1.c = t2.c`



##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
 metrics: ExecutionPlanMetricsSet,
 /// Cache holding plan properties like equivalences, output partitioning 
etc.
 cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,

Review Comment:
   Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` . `if 
on.is_empty()` mean no equijoin columns.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-08 Thread via GitHub


jonathanc-n closed pull request #16210: feat: Support null aware + equijoins 
for `NestedLoopJoin`
URL: https://github.com/apache/datafusion/pull/16210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-08 Thread via GitHub


jonathanc-n commented on PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2954336669

   @comphead Pointers have been resolved, good for another review. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-05 Thread via GitHub


jonathanc-n commented on PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2945916745

   There's an interesting implementation of index joins 
[here](https://github.com/duckdb/duckdb/pull/1008), but this would involve 
creating indexes. What are the thoughts on supporting `CREATE INDEX`?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-05 Thread via GitHub


jonathanc-n commented on PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2945598547

   @2010YOUY01 I was doing some benchmarks on NLJs vs. HJs and it looked bad 
even for cases where one table is very small which is what NLJs should excel 
at. One thing is to note that it does seem slightly faster when both tables are 
very small, especially in the case when the outer table is the larger table 
(makes sense due to hashing the build side, I should probably test this with 
SMJ cases). Duckdb has something interesting where if one of the tables have a 
certain row threshold (5), they choose to use a nested loop join for 
`on.is_some()` cases: 
[threshold](https://github.com/duckdb/duckdb/blob/a8a377580cc5d26ae90f16b41affad24ac6ee833/src/include/duckdb/main/client_config.hpp#L102)
 (we could probably do this as well when performance is better).
   
   ### Benchmarks
   Here is some experiment benchmarks that I had run if curious (one thing to 
note is that these benchmarks are just simulated on created row batches and are 
not read from disk + they are called directly from physical plan and not passed 
into the optimizer -- the batch sides would have been swapped): 
   
 Benchmark
 Benchmarking joins/HashJoin/l=5_r=10165536: Collecting 100 samples in 
estimated 7.3643 s (300 ite
   joins/HashJoin/l=5_r=10165536
   time:   [24.433 ms 24.499 ms 24.567 ms]
   change: [-26.227% -21.563% -17.138%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
 5 (5.00%) high mild
   Benchmarking joins/NestedLoopJoin/l=5_r=10165536: Warming up for 3. s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 40.0s, or reduce sample count to 10.
   Benchmarking joins/NestedLoopJoin/l=5_r=10165536: Collecting 100 samples in 
estimated 40.016 s (1
   joins/NestedLoopJoin/l=5_r=10165536
   time:   [399.05 ms 400.87 ms 403.07 ms]
   change: [+0.2540% +0.7221% +1.3508%] (p = 0.00 < 
0.05)
   Change within noise threshold.
   Found 10 outliers among 100 measurements (10.00%)
 3 (3.00%) high mild
 7 (7.00%) high severe
   Benchmarking joins/HashJoin/l=5_r=1165536: Collecting 100 samples in 
estimated 5.0385 s (2000 ite
   joins/HashJoin/l=5_r=1165536
   time:   [2.4954 ms 2.5147 ms 2.5488 ms]
   change: [+0.3564% +1.1539% +2.5753%] (p = 0.03 < 
0.05)
   Change within noise threshold.
   Found 6 outliers among 100 measurements (6.00%)
 2 (2.00%) high mild
 4 (4.00%) high severe
   Benchmarking joins/NestedLoopJoin/l=5_r=1165536: Collecting 100 samples in 
estimated 8.8061 s (20
   joins/NestedLoopJoin/l=5_r=1165536
   time:   [43.398 ms 43.540 ms 43.721 ms]
   change: [-3.4531% -2.3846% -1.4403%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
 4 (4.00%) high mild
 4 (4.00%) high severe
   Benchmarking joins/HashJoin/l=5_r=536: Collecting 100 samples in estimated 
5.0472 s (510k iterati
   joins/HashJoin/l=5_r=536
   time:   [9.9513 Β΅s 10.019 Β΅s 10.090 Β΅s]
   change: [-0.8239% -0.0294% +0.8130%] (p = 0.95 > 
0.05)
   No change in performance detected.
   Found 2 outliers among 100 measurements (2.00%)
 1 (1.00%) high mild
 1 (1.00%) high severe
   Benchmarking joins/NestedLoopJoin/l=5_r=536: Collecting 100 samples in 
estimated 5.0749 s (202k i
   joins/NestedLoopJoin/l=5_r=536
   time:   [25.074 Β΅s 25.114 Β΅s 25.160 Β΅s]
   change: [-79.602% -79.115% -78.699%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
 1 (1.00%) high mild
 3 (3.00%) high severe
   Benchmarking joins/HashJoin/l=5_r=36: Collecting 100 samples in estimated 
5.0396 s (556k iteratio
   joins/HashJoin/l=5_r=36 time:   [9.1466 Β΅s 9.2176 Β΅s 9.2864 Β΅s]
   change: [+0.5504% +2.4160% +5.0403%] (p = 0.02 < 
0.05)
   Change within noise threshold.
   Found 3 outliers among 100 measurements (3.00%)
 1 (1.00%) high mild
 2 (2.00%) high severe
   Benchmarking joins/NestedLoopJoin/l=5_r=36: Collecting 100 samples in 
estimated 5.0394 s (641k it
   joins/NestedLoopJoin/l=5_r=36
   time:   [7.7842 Β΅s 7.8208 Β΅s 7.8561 Β΅s]
   change: [-92.509% -92.452% -92.375%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
 1 (1.00%) low mild
 2 (2.00%) high mild
 1 (1.00%) high severe
   Benchmarking joins/HashJoin/l=1

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-03 Thread via GitHub


comphead commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2125246086


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -1129,6 +1316,95 @@ pub(crate) mod tests {
 )
 }
 
+fn build_left_table_equijoin() -> Arc {
+build_table(
+("a1", &vec![12, 2, 11]),
+("b1", &vec![2, 2, 8]),
+("c1", &vec![50, 90, 110]),
+None,
+Vec::new(),
+)
+}
+
+fn build_right_table_equijoin() -> Arc {
+build_table(
+("a2", &vec![12, 2, 10]),
+("b2", &vec![10, 2, 10]),
+("c2", &vec![40, 80, 100]),
+None,
+Vec::new(),
+)
+}
+
+/// This builds table with nullable values (uses same logic as 
`build_table` above)
+fn build_table_null(
+a: (&str, &Vec>),
+b: (&str, &Vec>),
+c: (&str, &Vec>),
+batch_size: Option,
+sorted_column_names: Vec<&str>,
+) -> Arc {
+// build 0 or more sliced RecordBatches with nullable values
+let batch = build_table_i32_null(a, b, c);
+let schema = batch.schema();
+
+let batches = if let Some(batch_size) = batch_size {
+let num_batches = batch.num_rows().div_ceil(batch_size);
+(0..num_batches)
+.map(|i| {
+let start = i * batch_size;
+let remaining_rows = batch.num_rows() - start;
+batch.slice(start, batch_size.min(remaining_rows))
+})
+.collect::>()
+} else {
+vec![batch]
+};
+
+// create the in‑memory source
+let mut source =
+TestMemoryExec::try_new(&[batches], Arc::clone(&schema), 
None).unwrap();
+
+// attach sort information if requested
+if !sorted_column_names.is_empty() {
+let mut sort_info = LexOrdering::default();
+for name in sorted_column_names {
+let idx = schema.index_of(name).unwrap();
+sort_info.push(PhysicalSortExpr {
+expr: Arc::new(Column::new(name, idx)),
+options: SortOptions {
+descending: false,
+nulls_first: false,
+},
+});
+}
+source = 
source.try_with_sort_information(vec![sort_info]).unwrap();
+}
+
+// wrap in cache and return
+Arc::new(TestMemoryExec::update_cache(Arc::new(source)))
+}
+
+fn build_left_table_null() -> Arc {

Review Comment:
   a nit: but I think it would be easier to show the table contents in the 
content. 
   The table can be built in text format like here 
https://www.tablesgenerator.com/text_tables



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-03 Thread via GitHub


comphead commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2125235875


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -810,6 +871,123 @@ fn build_join_indices(
 }
 }
 
+// Find matching indices based on join `on` predicates
+fn get_equijoin_match(
+left_indices: UInt64Array,
+right_indices: UInt32Array,
+left_batch: &RecordBatch,
+right_batch: &RecordBatch,
+on: &Vec<(PhysicalExprRef, PhysicalExprRef)>,
+null_equals_null: bool,
+) -> Result<(UInt64Array, UInt32Array)> {
+// Create the different `ArrayRef`s holding the values that were evaluated
+// against each expression in the `on` predicate
+let left_arrays: Vec = on
+.iter()
+.map(|(l, _)| l.evaluate(left_batch))
+.collect::>>()?
+.into_iter()
+.map(|cv| cv.into_array(left_batch.num_rows()).unwrap())
+.collect();
+let right_arrays: Vec = on
+.iter()
+.map(|(_, r)| r.evaluate(right_batch))
+.collect::>>()?
+.into_iter()
+.map(|cv| cv.into_array(right_batch.num_rows()).unwrap())
+.collect();
+
+let mut out_l = UInt64Builder::new();
+let mut out_r = UInt32Builder::new();
+
+// Goes through both left and right indices and compares the values
+for (l_idx, r_idx) in 
left_indices.values().iter().zip(right_indices.values()) {
+if compare_arrays(
+&left_arrays,
+*l_idx as usize,
+&right_arrays,
+*r_idx as usize,
+null_equals_null,
+)? {
+out_l.append_value(*l_idx);
+out_r.append_value(*r_idx);
+}
+}
+
+Ok((out_l.finish(), out_r.finish()))
+}
+
+// Compares values in the array, returns true if match, false otherwise
+fn compare_arrays(
+left_arrays: &[ArrayRef],
+left: usize,

Review Comment:
   this reminds me of SMJ, the similar mechanism, probably we can factor it out?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-05-30 Thread via GitHub


2010YOUY01 commented on PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2924177927

   This feature is intended purely for performance, right? Not because there 
are cases that can only be run with NLJ.
   If that's the case, I suggest to do some benchmarks upfront to justify the 
additional complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-05-30 Thread via GitHub


jonathanc-n commented on PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2922765656

   I believe a follow up pull request can be made to actually have this be used 
in the physical plan based on some statistics. @comphead Does this look fine 
for now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-05-29 Thread via GitHub


jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2114981517


##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -810,6 +871,123 @@ fn build_join_indices(
 }
 }
 
+// Find matching indices based on join `on` predicates

Review Comment:
   Not sure if this is the most efficient way? I tried to add comments wherever 
possible to make the review process easier, and the logic is straight forward 
(should be simple to review πŸ˜„ ). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]