kosiew commented on code in PR #23188:
URL: https://github.com/apache/datafusion/pull/23188#discussion_r3490600637
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -1099,6 +1099,20 @@ pub fn binary(
Ok(Arc::new(BinaryExpr::new(lhs, op, rhs)))
}
+fn sql_similar_to_regex(pattern: &str) -> String {
+ let mut result = String::with_capacity(pattern.len() + 6);
+ result.push_str("^(?:");
+ for ch in pattern.chars() {
+ match ch {
+ '%' => result.push_str(".*"),
+ '_' => result.push('.'),
+ c => result.push(c),
Review Comment:
Thanks for tackling this. I think there is still one correctness issue here.
`SIMILAR TO` currently copies every non-`%` and non-`_` character directly
into the Arrow regex. That means regex metacharacters like `.`, `^`, and `$`
are still treated as regex operators even though they are literals in SQL
`SIMILAR TO` patterns.
For example, the SQL pattern `a.` should only match the literal string `a.`,
but the current translation produces `^(?:a.)$`, so `SELECT 'ab' SIMILAR TO
'a.'` incorrectly returns true.
Could we translate the SQL pattern grammar explicitly instead? SQL literals
should be escaped for the regex, and only the metacharacters that `SIMILAR TO`
actually defines should be emitted as regex syntax.
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -1099,6 +1099,20 @@ pub fn binary(
Ok(Arc::new(BinaryExpr::new(lhs, op, rhs)))
}
+fn sql_similar_to_regex(pattern: &str) -> String {
+ let mut result = String::with_capacity(pattern.len() + 6);
+ result.push_str("^(?:");
+ for ch in pattern.chars() {
+ match ch {
+ '%' => result.push_str(".*"),
Review Comment:
One other correctness issue I noticed is that `%` and `_` are translated to
`.*` and `.`, but Rust and Arrow regexes do not let `.` match newlines by
default.
SQL wildcards are expected to match any character, including newlines, so
values containing `\n` still behave incorrectly. For example, `'a\nb' SIMILAR
TO 'a%b'` should match, but the generated `^(?:a.*b)$` does not.
Could we use a dot-all form such as `(?s:.*)` and `(?s:.)`, or an equivalent
character class? It would also be great to add a regression test for this case.
##########
datafusion/sqllogictest/test_files/strings.slt:
##########
@@ -91,6 +91,27 @@ P1e1
P1m1e1
e1
+# SIMILAR TO with `%` wildcard (zero or more characters)
Review Comment:
The new SLT coverage covers the common `%`, `_`, and anchoring cases well.
After updating the translator, could we also add a small regression test
showing that regex metacharacters are treated as literals? For example,
`SIMILAR TO 'a.'` should match `'a.'` but not `'ab'`. That should help prevent
accidental regressions back to raw regex semantics.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]