[ https://issues.apache.org/jira/browse/CALCITE-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824900#comment-17824900 ]
EveyWu edited comment on CALCITE-6278 at 3/9/24 3:54 AM: ---------------------------------------------------------- [~julianhyde] Thanks for the review. 1. "Since Spark 2.0, string literals (including regex patterns) are unescaped in SQL parser", this description comes from Spark [official documentation|#regexp].] !image-2024-03-09-11-13-49-064.png|width=491,height=176! 2. In Spark, unescape is indeed performed in the parser phase. Please view the details in `AstBuilder`: [https://github.com/apache/spark/blob/76b1c122cb7d77e8f175b25b935b9296a669d5d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L2876C1-L2882C4] The default value of `spark.sql.parser.escapedStringLiterals` is false. !image-2024-03-09-11-38-08-797.png|width=455,height=85! 3. In Hive, unescape is not done in the SQL AST parser phase, but in the Node normalization phase(`Dispatcher#dispatch`). `StrExprProcessor` is the processor for processing string unescape. [https://github.com/apache/hive/blob/03a76ac70370fb94a78b00496ec511e671c652f2/ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java#L403C1-L405C17] !image-2024-03-09-11-37-27-816.png|width=520,height=132! 4. "If unescaping is happening in Spark’s parser, Calcite should also do it in the parser", I think this is unnecessary, First, like Spark and Hive, different engines have different processing methods, which do not necessarily have to be processed in the same phase. In addition, this unescape processing is global and not only for the `rlike` function. Finally, Calcite is handled in the `rlike` function, which is by far the simplest and minimal impact modification. If Calcite also needs to perform global string unescape processing, it can be discussed separately in the subsequent Jira. was (Author: eveywu): [~julianhyde] Thanks for the review. 1. "Since Spark 2.0, string literals (including regex patterns) are unescaped in SQL parser", this description comes from Spark [official documentation|[https://spark.apache.org/docs/latest/api/sql/index.html#regexp].] !image-2024-03-09-11-13-49-064.png|width=491,height=176! 2. In Spark, unescape is indeed performed in the parser phase. Please view the details in AstBuilder: [https://github.com/apache/spark/blob/76b1c122cb7d77e8f175b25b935b9296a669d5d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L2876C1-L2882C4] The default value of `spark.sql.parser.escapedStringLiterals` is false. !image-2024-03-09-11-38-08-797.png|width=455,height=85! 3. In Hive, unescape is not done in the SQL AST parser phase, but in the Node normalization phase(`Dispatcher#dispatch`). `StrExprProcessor` is the processor for processing string unescape. [https://github.com/apache/hive/blob/03a76ac70370fb94a78b00496ec511e671c652f2/ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java#L403C1-L405C17] !image-2024-03-09-11-37-27-816.png|width=520,height=132! 4. "If unescaping is happening in Spark’s parser, Calcite should also do it in the parser", I think this is unnecessary, First, like Spark and Hive, different engines have different processing methods, which do not necessarily have to be processed in the same phase. In addition, this unescape processing is global and not only for the `rlike` function. Finally, Calcite is handled in the `rlike` function, which is by far the simplest and minimal impact modification. If Calcite also needs to perform global string unescape processing, it can be discussed separately in the subsequent Jira. > Add REGEXP, REGEXP_LIKE function (enabled in Spark library) > ------------------------------------------------------------ > > Key: CALCITE-6278 > URL: https://issues.apache.org/jira/browse/CALCITE-6278 > Project: Calcite > Issue Type: Improvement > Reporter: EveyWu > Priority: Minor > Labels: pull-request-available > Attachments: image-2024-03-07-09-32-27-002.png, > image-2024-03-09-11-13-49-064.png, image-2024-03-09-11-37-27-816.png, > image-2024-03-09-11-38-08-797.png > > > Add Spark functions that have been implemented but have different > OperandTypes/Returns. > Add Function > [REGEXP|https://spark.apache.org/docs/latest/api/sql/index.html#regexp], > [REGEXP_LIKE|https://spark.apache.org/docs/latest/api/sql/index.html#regexp_like] > # Since this function has the same implementation as the Spark > [RLIKE|https://spark.apache.org/docs/latest/api/sql/index.html#rlike] > function, the implementation can be directly reused. > # Since Spark 2.0, string literals (including regex patterns) are unescaped > in SQL parser, also fix this bug in calcite. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)