Re: [PR] Enhance: simplify x=x [datafusion]
ding-young commented on code in PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#discussion_r2032229805
##
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##
@@ -2152,6 +2169,17 @@ mod tests {
}
}
+#[test]
+fn test_simplify_eq_not_self() {
+let expr_a = col("c2").eq(col("c2"));
+let expr_b = col("c2_non_null").eq(col("c2_non_null"));
+let expected_a = col("c2").is_not_null().or(lit_bool_null());
+let expected_b = lit(true);
Review Comment:
I updated the comments. Thank you :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
alamb commented on code in PR #15589: URL: https://github.com/apache/datafusion/pull/15589#discussion_r2036171313 ## datafusion/sqllogictest/test_files/array.slt: ## @@ -6140,21 +6140,19 @@ logical_plan 02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] 03)SubqueryAlias: test 04)--SubqueryAlias: t -05)Projection: -06)--Filter: __common_expr_3 = __common_expr_3 -07)Projection: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) AS __common_expr_3 -08)--TableScan: tmp_table projection=[value] +05)Projection: Review Comment: There is a comment a few lines above that says: > # TODO: this should probably be possible to completely remove the filter as always true? We can probably update that too -- but we could do it in a follow on PR as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
kosiew commented on code in PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#discussion_r2030907522
##
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##
@@ -760,6 +760,23 @@ impl TreeNodeRewriter for Simplifier<'_,
S> {
None => lit_bool_null(),
})
}
+// A = A --> A IS NOT NULL OR NULL
+// A = A --> true (if A not nullable)
+Expr::BinaryExpr(BinaryExpr {
+left,
+op: Eq,
+right,
+}) if (left == right) & !left.is_volatile() => {
+Transformed::yes(match !info.nullable(&right)? {
+true => lit(true),
+false => Expr::BinaryExpr(BinaryExpr {
+left: Box::new(Expr::IsNotNull(left)),
+op: Or,
+right: Box::new(lit_bool_null()),
+}),
+})
+}
Review Comment:
🔐 Suggestion: Consider Using left Instead of right for nullable Check
While left == right ensures equality, semantically it might be clearer to
consistently use left for readability and convention. This avoids any reader
confusion on whether left or right is being evaluated for nullability.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
kosiew commented on code in PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#discussion_r2030914487
##
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##
@@ -2152,6 +2169,17 @@ mod tests {
}
}
+#[test]
+fn test_simplify_eq_not_self() {
+let expr_a = col("c2").eq(col("c2"));
+let expr_b = col("c2_non_null").eq(col("c2_non_null"));
+let expected_a = col("c2").is_not_null().or(lit_bool_null());
+let expected_b = lit(true);
Review Comment:
```suggestion
// `expr_a`: column `c2` is nullable, so `c2 = c2` simplifies to `c2 IS
NOT NULL OR NULL`
// This ensures the expression is only true when `c2` is not NULL,
accounting for SQL's NULL semantics.
let expected_a = col("c2").is_not_null().or(lit_bool_null());
// `expr_b`: column `c2_non_null` is explicitly non-nullable, so
`c2_non_null = c2_non_null` is always true
let expected_b = lit(true);
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
berkaysynnada commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2782457130 > I wrote a simple criterion bench (https://github.com/ding-young/datafusion/blob/simplify-trivial-eq-bench/datafusion/core/benches/simplify_trivial_eq.rs) and the result is as follows. It showed performance improvement, especially for comparing long strings. Thank you for sharing the results. It doesn't effect much here but next time you can exclude the create_context() part out of bench_function :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
ding-young commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2782347240 I wrote a simple criterion bench (https://github.com/ding-young/datafusion/blob/simplify-trivial-eq-bench/datafusion/core/benches/simplify_trivial_eq.rs) and the result is as follows. It showed performance improvement, especially for comparing long strings. ``` select_f32_eq_f32 time: [366.19 µs 366.42 µs 366.69 µs] change: [-79.571% -79.180% -78.787%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe select_str_eq_str time: [372.53 µs 373.02 µs 373.61 µs] change: [-85.333% -85.117% -84.884%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe select_str_eq_str_long time: [372.49 µs 372.85 µs 373.25 µs] change: [-98.351% -98.316% -98.280%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe ``` Plus, since `x IS NOT NULL OR NULL` is slightly better than `CASE`, I'll leave the pr as is :) ``` select_CASE_str_eq_str_long time: [396.36 µs 396.73 µs 397.15 µs] Found 10 outliers among 100 measurements (10.00%) 2 (2.00%) high mild 8 (8.00%) high severe ``` Thanks for the review! Trying out the bench was also a great experience for me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
alamb commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781429451 We could also use a CASE ```sql CASE x IS NOT NULL THEN true ELSE null END ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
alamb commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781429711 I expect this to make a large performance difference when x is a string type (as string comparisons are fairly expensive) Thank you for this PR @ding-young and the great reviews @2010YOUY01 and @berkaysynnada -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
berkaysynnada commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781400999 > About the performance, I'm not 100% sure whether this rule worth the change, but I made this change because `IS NOT NULL OR NULL` was a bit better than actual comparision. I recommend writing a simple benchmark by criterion or other tools to get more reliable results instead of measuring the CLI execution -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
ding-young commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781281225 @2010YOUY01 Instead of applying transformation on filter expression, I adjusted the rule to transform x=x into `x IS NOT NULL OR NULL`. This preserves the behavior where NULL = NULL evaluates to NULL, which is treated as FALSE in filter expressions. About the performance, I'm not 100% sure whether this rule worth the change, but I made this change because `IS NOT NULL OR NULL` was a bit better than actual comparision.  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
ding-young commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2780386375 > This optimizer transformation doesn't look right for projected columns like > > ``` > create table t1(v1 int); > insert into t1 values (null); > > select v1=v1 from t1; > -- Should return `NULL`, but get `False` > ``` > > However it makes sense if it is only applied on filter expressions. Thank you for review. I'll make the change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify x=x [datafusion]
2010YOUY01 commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2780385095 This optimizer transformation doesn't look right for projected columns like ``` create table t1(v1 int); insert into t1 values (null); select v1=v1 from t1; -- Should return `NULL`, but get `False` ``` However it makes sense if it is only applied on filter expressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
