Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n closed pull request #16210: feat: Support null aware + equijoins for `NestedLoopJoin` URL: https://github.com/apache/datafusion/pull/16210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2147410263
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
> > If null_equal_null is true, we can replace = in the ON clause with <=>
and merge it into the filter in the planner
>
> This seems slower, we should just pass this to smj or hj. I can play
around with the threshold logic that duckdb has for smaller tables where nljs
may be faster.
>
> >
Under the premise that we prefer NLJ
Rest lgtm
null_equals_null equals true + <=> condition -> we can always pass the <=>
expressions into the on clause
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144282145
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
Ok this sounds like a good plan, I think I will try to reiterate and run
some benchmarks to see if it is worth it.
> If null_equal_null is true, we can replace = in the ON clause with <=> and
merge it into the filter in the planner
This seems slower, we should just pass this to smj or hj. I can play around
with the threshold logic that duckdb has for smaller tables where nljs may be
faster.
> There's **no equal(`=` not `<=>`) join condition in `on`,** so the planner
choose nlj
This is interesting because how will sort merge join + hash join when to
treat . And since hash join and smj are usually faster for this equi
conditions, instead of passing in <=> into the join filter we can flip the null
equals null bit to true and run the <=> in the `on` clause; but this would
clash with a `=` if null_equals_null is false. The solution I was thinking of
is these cases:
- `null_equals_null` equals true + `<=>` condition + `=` condition -> we
pass the `<=>` expressions into the `on` clause
- `null_equals_null` equals false + `<=>` condition + `=` condition ->
`<=>` stays in join filter
- `null_equals_null` equals false + `<=>` condition + no `=` condition -> we
flip the `null_equals_null` to true and pass the `<=>` expressions into the
`on` clause
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144282145
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
> If null_equal_null is true, we can replace = in the ON clause with <=> and
merge it into the filter in the planner
This seems slower, we should just pass this to smj or hj. I can play around
with the threshold logic that duckdb has for smaller tables where nljs may be
faster.
> There's **no equal(`=` not `<=>`) join condition in `on`,** so the planner
choose nlj
Since hash join and smj are usually faster for this equi conditions, instead
of passing in <=> into the join filter we can flip the null equals null bit to
true and run the <=> in the `on` clause; but this would clash with a `=` if
null_equals_null is false. The solution I was thinking of is these cases:
- `null_equals_null` equals true + `<=>` condition + `=` condition -> we
pass the `<=>` expressions into the `on` clause
- `null_equals_null` equals false + `<=>` condition + `=` condition ->
`<=>` stays in join filter
- `null_equals_null` equals false + `<=>` condition + no `=` condition -> we
flip the `null_equals_null` to true and pass the `<=>` expressions into the
`on` clause
WDYT @UBarney
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144201354
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
I missed that we may **convert EXCEPT to a null aware join**. Sorry for that.
If null_equal_null is true, we can replace `=` in the ON clause with `<=>`
and merge it into the filter in the planner. Otherwise, we merge the ON clause
into the filter directly.
> how were you able to have the query use a nested loop join with a on
condition?
There's **no equal(`=` not `<=>`) join condition in `on`,** so the planner
choose nlj
https://github.com/apache/datafusion/blob/e5f596b59ece71329635d41fd7f147d1d702e49e/datafusion/core/src/physical_planner.rs#L1104-L
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144201354
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
I missed that we may convert EXCEPT to a null aware join. Sorry for that.
If null_equal_null is true, we can replace `=` in the ON clause with `<=>`
and merge it into the filter in the planner. Otherwise, we merge the ON clause
into the filter directly.
> how were you able to have the query use a nested loop join with a on
condition?
There's **no equal(`=` not `<=>`) join condition in `on`,** so the planner
choose nlj
https://github.com/apache/datafusion/blob/e5f596b59ece71329635d41fd7f147d1d702e49e/datafusion/core/src/physical_planner.rs#L1104-L
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144201354
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
I missed that we may convert EXCEPT to a null aware join. Sorry for that.
If null_equal_null is true, we can replace `=` in the ON clause with `<=>`
and merge it into the filter in the planner. Otherwise, we merge the ON clause
into the filter directly.
> how were you able to have the query use a nested loop join with a on
condition?
There's **no equal(`=`) join condition in `on`,** so the planner choose nlj
https://github.com/apache/datafusion/blob/e5f596b59ece71329635d41fd7f147d1d702e49e/datafusion/core/src/physical_planner.rs#L1104-L
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144201354
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
I missed that we may convert EXCEPT to a null aware join. Sorry for that.
If null_equal_null is true, we can replace `=` in the ON clause with `<=>`
and merge it into the filter in the planner. Otherwise, we merge the ON clause
into the filter directly.
> how were you able to have the query use a nested loop join with a on
condition?
There's **no equal join condition in `on`,** so the planner choose nlj
https://github.com/apache/datafusion/blob/e5f596b59ece71329635d41fd7f147d1d702e49e/datafusion/core/src/physical_planner.rs#L1104-L
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144169706
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
I am curious though, how were you able to have the query use a nested loop
join with a `on` condition?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144167426
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
Oh yeah that makes sense, just have it be merged in the join filter.
:facepalm: Thanks for finding this π. I think I'll just put out a pull request
removing the todo statement, that statement led me to think that null aware
wasn't supported yet.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144121144
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
Let's just skip `ExtractEquijoinPredicate` if we prefer NLJ.
https://github.com/apache/datafusion/blob/8d79bddeda71bc415ba9bf85e77476ab1784c665/datafusion/optimizer/src/extract_equijoin_predicate.rs#L30-L43
I assume we achieve the behavior of `null_equals_null` by using `<=>`. For
example: `select t1.value from range(6) t1 join range(6) t2 on t1.value <=>
t2.value;`. The current spaceship operator can already handle this.
https://github.com/apache/datafusion/blob/916ccc39f2af1aa3fb01f1c7e406fbcdbc03cc13/datafusion/sqllogictest/test_files/expr.slt#L1595-L1612
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144133973
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
Looks like current nlj already supports null_equal_null π
```
> explain SELECT
*
FROM
(VALUES (1, 'Apple'), (2, 'Banana'), (NULL, 'Cherry')) AS i(id, name)
JOIN
(VALUES (1, 'Fruit'), (3, 'Vegetable'), (NULL, 'Unknown')) AS c(id,
category)
ON
i.id <=> c.id;
+---++
| plan_type | plan
|
+---++
| physical_plan | βββββ
|
| | β NestedLoopJoinExecββββ
|
| | βββ¬ββ β
|
| | βββ΄βββββ΄ββ
|
| | β ProjectionExec ββ ProjectionExec β
|
| | β ββ β
|
| | βid: column1ββ category: column2 β
|
| | β name: column2 ββid: column1β
|
| | βββ¬βββββ¬ββ
|
| | βββ΄βββββ΄ββ
|
| | β DataSourceExec ββ DataSourceExec β
|
| | β ββ β
|
| | βbytes: 1400ββbytes: 1400β
|
| | β format: memory ββ format: memory β
|
| | β rows: 1 ββ rows: 1 β
|
| | ββββββββββ
|
| |
|
+---++
1 row(s) fetched.
Elapsed 0.001 seconds.
> SELECT
*
FROM
(VALUES (1, 'Apple'), (2, 'Banana'), (NULL, 'Cherry')) AS i(id, name)
JOIN
(VALUES (1, 'Fruit'), (3, 'Vegetable'), (NULL, 'Unknown')) AS c(id,
category)
ON
i.id <=> c.id;
+--++--+--+
| id | name | id | category |
+--++--+--+
| 1| Apple | 1| Fruit|
| NULL | Cherry | NULL | Unknown |
+--++--+--+
2 row(s) fetched.
Elapsed 0.001 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2144121144
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
Let's just skip ExtractEquijoinPredicate if we prefer NLJ.
https://github.com/apache/datafusion/blob/8d79bddeda71bc415ba9bf85e77476ab1784c665/datafusion/optimizer/src/extract_equijoin_predicate.rs#L30-L43
I assume we can achieve the behavior of `null_equals_null` by using `<=>`.
For example: `select t1.value from range(6) t1 join range(6) t2 on t1.value <=>
t2.value;`. The current spaceship operator can already handle this.
https://github.com/apache/datafusion/blob/916ccc39f2af1aa3fb01f1c7e406fbcdbc03cc13/datafusion/sqllogictest/test_files/expr.slt#L1595-L1612
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
Let's just skip `ExtractEquijoinPredicate` if we prefer NLJ.
https://github.com/apache/datafusion/blob/8d79bddeda71bc415ba9bf85e77476ab1784c665/datafusion/optimizer/src/extract_equijoin_predicate.rs#L30-L43
I assume we can achieve the behavior of `null_equals_null` by using `<=>`.
For example: `select t1.value from range(6) t1 join range(6) t2 on t1.value <=>
t2.value;`. The current spaceship operator can already handle this.
https://github.com/apache/datafusion/blob/916ccc39f2af1aa3fb01f1c7e406fbcdbc03cc13/datafusion/sqllogictest/test_files/expr.slt#L1595-L1612
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2143227797
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
But how would you suggest doing the null_equal_null check. I do think that
it willl probably be faster to merge the two.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2142968386
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
> to make sure that we can do the null_equals_null check
We can also do this by merging on condition into filter.
For example: filter: `t1.a < t2.a, on: t1.c = t2.c` -> `filter: t1.a < t2.a
and (t1.c = t2.c or (is_null(t1.c) and (is_null(t2.c`
> this is how it is implemented for the other joins as well
Hash Joins and Sort-Merge Joins are fundamentally reliant on equality
conditions. Their high-speed mechanisms (hashing, merging) are impossible
without them.
But for nlj separating equality conditions doesn't improve performance
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2142968386
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
> we can do the null_equals_null check
We can do this by merging on condition into filter.
For example: filter: `t1.a < t2.a, on: t1.c = t2.c` -> `filter: t1.a < t2.a
and (t1.c = t2.c or (is_null(t1.c) and (is_null(t2.c`
> this is how it is implemented for the other joins as well
Hash Joins and Sort-Merge Joins are fundamentally reliant on equality
conditions. Their high-speed mechanisms (hashing, merging) are impossible
without them.
But for nlj separating equality conditions doesn't improve performance
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2141190301
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
> Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` .
`on.is_empty() == true` mean no equijoin columns.
Because nested loop joins exclusively take in filter clauses more often than
not, I decided it'd be easier to check if `on.is_some()`. Currently the
physical planner doesnt take in nested loop join for equijoins on purpose, this
will be added in another pr.
> If we can merge `on` condition into `filter`, we can remove this field and
reuse existing filter logic. For example: `filter: t1.a < t2.a, on: t1.c =
t2.c` -> `filter: t1.a < t2.a and t1.c = t2.c`
I separated merging the filter + on condition to make sure that we can do
the null_equals_null check (this is how it is implemented for the other joins
as well).
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
> Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` .
`on.is_empty() == true` mean no equijoin columns.
Because nested loop joins exclusively take in filter clauses more often than
not, I decided it'd be easier to check if `on.is_some()`. Currently the
physical planner doesnt take in nested loop join for equijoins on purpose, this
will be added in another pr.
> If we can merge `on` condition into `filter`, we can remove this field and
reuse existing filter logic. For example: `filter: t1.a < t2.a, on: t1.c =
t2.c` -> `filter: t1.a < t2.a and t1.c = t2.c`
I separated merging the filter + on condition to make sure that we can do
the null_equals_null check (this is how it is implemented for the other joins
as well).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2141190301
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
> Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` .
`on.is_empty() == true` mean no equijoin columns.
Because nested loop joins exclusively take in filter clauses more often than
not, I decided it'd be easier to check if `on.is_some()`. Currently the
physical planner doesnt take in nested loop join for equijoins on purpose, this
will be added in another pr.
> If we can merge `on` condition into `filter`, we can remove this field and
reuse existing filter logic. For example: `filter: t1.a < t2.a, on: t1.c =
t2.c` -> `filter: t1.a < t2.a and t1.c = t2.c`
I separated merging the filter + on condition to make sure that we can do
the null_equals_null check (this is how it is implemented for the other joins
as well).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2140090439
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` .
`on.is_empty() == true` mean no equijoin columns.
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
If we can merge `on` condition into `filter`, we can remove this field and
reuse existing filter logic.
For example: `filter: t1.a < t2.a, on: t1.c = t2.c` -> `filter: t1.a < t2.a
and t1.c = t2.c`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
UBarney commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2140108914
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
If we can merge `on` condition into `filter`, we can remove this field and
reuse current existing filter logic.
For example: `filter: t1.a < t2.a, on: t1.c = t2.c` -> `filter: t1.a < t2.a
and t1.c = t2.c`
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -178,6 +187,18 @@ pub struct NestedLoopJoinExec {
metrics: ExecutionPlanMetricsSet,
/// Cache holding plan properties like equivalences, output partitioning
etc.
cache: PlanProperties,
+/// Null matching behavior: If `null_equals_null` is true, rows that have
+/// `null`s in both left and right equijoin columns will be matched.
+/// Otherwise, rows that have `null`s in the join columns will not be
+/// matched and thus will not appear in the output.
+null_equals_null: bool,
+/// Set of equijoin columns from the relations: `(left_col, right_col)`
+///
+/// This is optional as a nested loop join can be passed a 'on' clause
+/// in the case that a Hash Join cost is more expensive than a
+/// nested loop join or when a user would like to pick nested loop
+/// join by hint
+on: Option>,
Review Comment:
Maybe we can change type to `Vec<(PhysicalExprRef, PhysicalExprRef)>` . `if
on.is_empty()` mean no equijoin columns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n closed pull request #16210: feat: Support null aware + equijoins for `NestedLoopJoin` URL: https://github.com/apache/datafusion/pull/16210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2954336669 @comphead Pointers have been resolved, good for another review. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2945916745 There's an interesting implementation of index joins [here](https://github.com/duckdb/duckdb/pull/1008), but this would involve creating indexes. What are the thoughts on supporting `CREATE INDEX`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2945598547 @2010YOUY01 I was doing some benchmarks on NLJs vs. HJs and it looked bad even for cases where one table is very small which is what NLJs should excel at. One thing is to note that it does seem slightly faster when both tables are very small, especially in the case when the outer table is the larger table (makes sense due to hashing the build side, I should probably test this with SMJ cases). Duckdb has something interesting where if one of the tables have a certain row threshold (5), they choose to use a nested loop join for `on.is_some()` cases: [threshold](https://github.com/duckdb/duckdb/blob/a8a377580cc5d26ae90f16b41affad24ac6ee833/src/include/duckdb/main/client_config.hpp#L102) (we could probably do this as well when performance is better). ### Benchmarks Here is some experiment benchmarks that I had run if curious (one thing to note is that these benchmarks are just simulated on created row batches and are not read from disk + they are called directly from physical plan and not passed into the optimizer -- the batch sides would have been swapped): Benchmark Benchmarking joins/HashJoin/l=5_r=10165536: Collecting 100 samples in estimated 7.3643 s (300 ite joins/HashJoin/l=5_r=10165536 time: [24.433 ms 24.499 ms 24.567 ms] change: [-26.227% -21.563% -17.138%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild Benchmarking joins/NestedLoopJoin/l=5_r=10165536: Warming up for 3. s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 40.0s, or reduce sample count to 10. Benchmarking joins/NestedLoopJoin/l=5_r=10165536: Collecting 100 samples in estimated 40.016 s (1 joins/NestedLoopJoin/l=5_r=10165536 time: [399.05 ms 400.87 ms 403.07 ms] change: [+0.2540% +0.7221% +1.3508%] (p = 0.00 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe Benchmarking joins/HashJoin/l=5_r=1165536: Collecting 100 samples in estimated 5.0385 s (2000 ite joins/HashJoin/l=5_r=1165536 time: [2.4954 ms 2.5147 ms 2.5488 ms] change: [+0.3564% +1.1539% +2.5753%] (p = 0.03 < 0.05) Change within noise threshold. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) high mild 4 (4.00%) high severe Benchmarking joins/NestedLoopJoin/l=5_r=1165536: Collecting 100 samples in estimated 8.8061 s (20 joins/NestedLoopJoin/l=5_r=1165536 time: [43.398 ms 43.540 ms 43.721 ms] change: [-3.4531% -2.3846% -1.4403%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 4 (4.00%) high mild 4 (4.00%) high severe Benchmarking joins/HashJoin/l=5_r=536: Collecting 100 samples in estimated 5.0472 s (510k iterati joins/HashJoin/l=5_r=536 time: [9.9513 Β΅s 10.019 Β΅s 10.090 Β΅s] change: [-0.8239% -0.0294% +0.8130%] (p = 0.95 > 0.05) No change in performance detected. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe Benchmarking joins/NestedLoopJoin/l=5_r=536: Collecting 100 samples in estimated 5.0749 s (202k i joins/NestedLoopJoin/l=5_r=536 time: [25.074 Β΅s 25.114 Β΅s 25.160 Β΅s] change: [-79.602% -79.115% -78.699%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe Benchmarking joins/HashJoin/l=5_r=36: Collecting 100 samples in estimated 5.0396 s (556k iteratio joins/HashJoin/l=5_r=36 time: [9.1466 Β΅s 9.2176 Β΅s 9.2864 Β΅s] change: [+0.5504% +2.4160% +5.0403%] (p = 0.02 < 0.05) Change within noise threshold. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe Benchmarking joins/NestedLoopJoin/l=5_r=36: Collecting 100 samples in estimated 5.0394 s (641k it joins/NestedLoopJoin/l=5_r=36 time: [7.7842 Β΅s 7.8208 Β΅s 7.8561 Β΅s] change: [-92.509% -92.452% -92.375%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe Benchmarking joins/HashJoin/l=1
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
comphead commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2125246086
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -1129,6 +1316,95 @@ pub(crate) mod tests {
)
}
+fn build_left_table_equijoin() -> Arc {
+build_table(
+("a1", &vec![12, 2, 11]),
+("b1", &vec![2, 2, 8]),
+("c1", &vec![50, 90, 110]),
+None,
+Vec::new(),
+)
+}
+
+fn build_right_table_equijoin() -> Arc {
+build_table(
+("a2", &vec![12, 2, 10]),
+("b2", &vec![10, 2, 10]),
+("c2", &vec![40, 80, 100]),
+None,
+Vec::new(),
+)
+}
+
+/// This builds table with nullable values (uses same logic as
`build_table` above)
+fn build_table_null(
+a: (&str, &Vec>),
+b: (&str, &Vec>),
+c: (&str, &Vec>),
+batch_size: Option,
+sorted_column_names: Vec<&str>,
+) -> Arc {
+// build 0 or more sliced RecordBatches with nullable values
+let batch = build_table_i32_null(a, b, c);
+let schema = batch.schema();
+
+let batches = if let Some(batch_size) = batch_size {
+let num_batches = batch.num_rows().div_ceil(batch_size);
+(0..num_batches)
+.map(|i| {
+let start = i * batch_size;
+let remaining_rows = batch.num_rows() - start;
+batch.slice(start, batch_size.min(remaining_rows))
+})
+.collect::>()
+} else {
+vec![batch]
+};
+
+// create the inβmemory source
+let mut source =
+TestMemoryExec::try_new(&[batches], Arc::clone(&schema),
None).unwrap();
+
+// attach sort information if requested
+if !sorted_column_names.is_empty() {
+let mut sort_info = LexOrdering::default();
+for name in sorted_column_names {
+let idx = schema.index_of(name).unwrap();
+sort_info.push(PhysicalSortExpr {
+expr: Arc::new(Column::new(name, idx)),
+options: SortOptions {
+descending: false,
+nulls_first: false,
+},
+});
+}
+source =
source.try_with_sort_information(vec![sort_info]).unwrap();
+}
+
+// wrap in cache and return
+Arc::new(TestMemoryExec::update_cache(Arc::new(source)))
+}
+
+fn build_left_table_null() -> Arc {
Review Comment:
a nit: but I think it would be easier to show the table contents in the
content.
The table can be built in text format like here
https://www.tablesgenerator.com/text_tables
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
comphead commented on code in PR #16210:
URL: https://github.com/apache/datafusion/pull/16210#discussion_r2125235875
##
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##
@@ -810,6 +871,123 @@ fn build_join_indices(
}
}
+// Find matching indices based on join `on` predicates
+fn get_equijoin_match(
+left_indices: UInt64Array,
+right_indices: UInt32Array,
+left_batch: &RecordBatch,
+right_batch: &RecordBatch,
+on: &Vec<(PhysicalExprRef, PhysicalExprRef)>,
+null_equals_null: bool,
+) -> Result<(UInt64Array, UInt32Array)> {
+// Create the different `ArrayRef`s holding the values that were evaluated
+// against each expression in the `on` predicate
+let left_arrays: Vec = on
+.iter()
+.map(|(l, _)| l.evaluate(left_batch))
+.collect::>>()?
+.into_iter()
+.map(|cv| cv.into_array(left_batch.num_rows()).unwrap())
+.collect();
+let right_arrays: Vec = on
+.iter()
+.map(|(_, r)| r.evaluate(right_batch))
+.collect::>>()?
+.into_iter()
+.map(|cv| cv.into_array(right_batch.num_rows()).unwrap())
+.collect();
+
+let mut out_l = UInt64Builder::new();
+let mut out_r = UInt32Builder::new();
+
+// Goes through both left and right indices and compares the values
+for (l_idx, r_idx) in
left_indices.values().iter().zip(right_indices.values()) {
+if compare_arrays(
+&left_arrays,
+*l_idx as usize,
+&right_arrays,
+*r_idx as usize,
+null_equals_null,
+)? {
+out_l.append_value(*l_idx);
+out_r.append_value(*r_idx);
+}
+}
+
+Ok((out_l.finish(), out_r.finish()))
+}
+
+// Compares values in the array, returns true if match, false otherwise
+fn compare_arrays(
+left_arrays: &[ArrayRef],
+left: usize,
Review Comment:
this reminds me of SMJ, the similar mechanism, probably we can factor it out?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
2010YOUY01 commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2924177927 This feature is intended purely for performance, right? Not because there are cases that can only be run with NLJ. If that's the case, I suggest to do some benchmarks upfront to justify the additional complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2922765656 I believe a follow up pull request can be made to actually have this be used in the physical plan based on some statistics. @comphead Does this look fine for now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]
jonathanc-n commented on code in PR #16210: URL: https://github.com/apache/datafusion/pull/16210#discussion_r2114981517 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -810,6 +871,123 @@ fn build_join_indices( } } +// Find matching indices based on join `on` predicates Review Comment: Not sure if this is the most efficient way? I tried to add comments wherever possible to make the review process easier, and the logic is straight forward (should be simple to review π ). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
