Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-10 Thread via GitHub


ding-young commented on code in PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#discussion_r2032229805


##
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##
@@ -2152,6 +2169,17 @@ mod tests {
 }
 }
 
+#[test]
+fn test_simplify_eq_not_self() {
+let expr_a = col("c2").eq(col("c2"));
+let expr_b = col("c2_non_null").eq(col("c2_non_null"));
+let expected_a = col("c2").is_not_null().or(lit_bool_null());
+let expected_b = lit(true);

Review Comment:
   I updated the comments. Thank you :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-09 Thread via GitHub


alamb commented on code in PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#discussion_r2036171313


##
datafusion/sqllogictest/test_files/array.slt:
##
@@ -6140,21 +6140,19 @@ logical_plan
 02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
 03)SubqueryAlias: test
 04)--SubqueryAlias: t
-05)Projection:
-06)--Filter: __common_expr_3 = __common_expr_3
-07)Projection: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS 
Utf8), Int64(1), Int64(32)) AS __common_expr_3
-08)--TableScan: tmp_table projection=[value]
+05)Projection: 

Review Comment:
   There is a comment a few lines above that says:
   
   > # TODO: this should probably be possible to completely remove the filter 
as always true?
   
   We can probably update that too -- but we could do it in a follow on PR as 
well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-07 Thread via GitHub


kosiew commented on code in PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#discussion_r2030907522


##
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##
@@ -760,6 +760,23 @@ impl TreeNodeRewriter for Simplifier<'_, 
S> {
 None => lit_bool_null(),
 })
 }
+// A = A --> A IS NOT NULL OR NULL
+// A = A --> true (if A not nullable)
+Expr::BinaryExpr(BinaryExpr {
+left,
+op: Eq,
+right,
+}) if (left == right) & !left.is_volatile() => {
+Transformed::yes(match !info.nullable(&right)? {
+true => lit(true),
+false => Expr::BinaryExpr(BinaryExpr {
+left: Box::new(Expr::IsNotNull(left)),
+op: Or,
+right: Box::new(lit_bool_null()),
+}),
+})
+}

Review Comment:
   🔐 Suggestion: Consider Using left Instead of right for nullable Check
   
   While left == right ensures equality, semantically it might be clearer to 
consistently use left for readability and convention. This avoids any reader 
confusion on whether left or right is being evaluated for nullability.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-07 Thread via GitHub


kosiew commented on code in PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#discussion_r2030914487


##
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##
@@ -2152,6 +2169,17 @@ mod tests {
 }
 }
 
+#[test]
+fn test_simplify_eq_not_self() {
+let expr_a = col("c2").eq(col("c2"));
+let expr_b = col("c2_non_null").eq(col("c2_non_null"));
+let expected_a = col("c2").is_not_null().or(lit_bool_null());
+let expected_b = lit(true);

Review Comment:
   ```suggestion
   // `expr_a`: column `c2` is nullable, so `c2 = c2` simplifies to `c2 IS 
NOT NULL OR NULL`
   // This ensures the expression is only true when `c2` is not NULL, 
accounting for SQL's NULL semantics.
   let expected_a = col("c2").is_not_null().or(lit_bool_null());
   
   // `expr_b`: column `c2_non_null` is explicitly non-nullable, so 
`c2_non_null = c2_non_null` is always true
   let expected_b = lit(true);
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-07 Thread via GitHub


berkaysynnada commented on PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2782457130

   > I wrote a simple criterion bench 
(https://github.com/ding-young/datafusion/blob/simplify-trivial-eq-bench/datafusion/core/benches/simplify_trivial_eq.rs)
 and the result is as follows. It showed performance improvement, especially 
for comparing long strings.
   
   Thank you for sharing the results. It doesn't effect much here but next time 
you can exclude the create_context() part out of bench_function :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-07 Thread via GitHub


ding-young commented on PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2782347240

   I wrote a simple criterion bench 
(https://github.com/ding-young/datafusion/blob/simplify-trivial-eq-bench/datafusion/core/benches/simplify_trivial_eq.rs)
 and the result is as follows. 
   It showed performance improvement, especially for comparing long strings. 
   ```
   select_f32_eq_f32   time:   [366.19 µs 366.42 µs 366.69 µs]
   change: [-79.571% -79.180% -78.787%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
 2 (2.00%) high mild
 6 (6.00%) high severe
   
   select_str_eq_str   time:   [372.53 µs 373.02 µs 373.61 µs]
   change: [-85.333% -85.117% -84.884%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
 4 (4.00%) high mild
 6 (6.00%) high severe
   
   select_str_eq_str_long  time:   [372.49 µs 372.85 µs 373.25 µs]
   change: [-98.351% -98.316% -98.280%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 9 outliers among 100 measurements (9.00%)
 3 (3.00%) high mild
 6 (6.00%) high severe
   ```
   
   Plus, since `x IS NOT NULL OR NULL` is slightly better than `CASE`, I'll 
leave the pr as is :)  
   ```
   select_CASE_str_eq_str_long
   time:   [396.36 µs 396.73 µs 397.15 µs]
   Found 10 outliers among 100 measurements (10.00%)
 2 (2.00%) high mild
 8 (8.00%) high severe
   ```
   Thanks for the review! Trying out the bench was also a great experience for 
me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-06 Thread via GitHub


alamb commented on PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781429451

   We could also use a CASE 
   ```sql
   CASE x IS NOT NULL THEN true ELSE null END
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-06 Thread via GitHub


alamb commented on PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781429711

   I expect this to make a large performance difference when x is a string type 
(as string comparisons are fairly expensive)
   
   Thank you for this PR @ding-young and the great reviews @2010YOUY01 and 
@berkaysynnada 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-06 Thread via GitHub


berkaysynnada commented on PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781400999

   > About the performance, I'm not 100% sure whether this rule worth the 
change, but I made this change because `IS NOT NULL OR NULL` was a bit better 
than actual comparision. 
   
   I recommend writing a simple benchmark by criterion or other tools to get 
more reliable results instead of measuring the CLI execution
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-06 Thread via GitHub


ding-young commented on PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781281225

   @2010YOUY01 Instead of applying transformation on filter expression, I 
adjusted the rule to transform x=x into `x IS NOT NULL OR NULL`. This preserves 
the behavior where NULL = NULL evaluates to NULL, which is treated as FALSE in 
filter expressions.
   
   About the performance, I'm not 100% sure whether this rule worth the change, 
but I made this change because `IS NOT NULL OR NULL` was a bit better than 
actual comparision. 
   
![image](https://github.com/user-attachments/assets/75227289-e086-404e-ba7e-d3b8c85c1f53)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-05 Thread via GitHub


ding-young commented on PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2780386375

   > This optimizer transformation doesn't look right for projected columns like
   > 
   > ```
   > create table t1(v1 int);
   > insert into t1 values (null);
   > 
   > select v1=v1 from t1;
   > -- Should return `NULL`, but get `False`
   > ```
   > 
   > However it makes sense if it is only applied on filter expressions.
   
   Thank you for review. I'll make the change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-05 Thread via GitHub


2010YOUY01 commented on PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2780385095

   This optimizer transformation doesn't look right for projected columns like
   ```
   create table t1(v1 int);
   insert into t1 values (null);
   
   select v1=v1 from t1;
   -- Should return `NULL`, but get `False`
   ```
   
   However it makes sense if it is only applied on filter expressions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]