[jira] [Comment Edited] (CALCITE-6278) Add REGEXP, REGEXP_LIKE function (enabled in Spark library)

EveyWu (Jira) Fri, 08 Mar 2024 19:55:03 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824900#comment-17824900
 ]


 EveyWu edited comment on CALCITE-6278 at 3/9/24 3:54 AM:
----------------------------------------------------------

[~julianhyde] Thanks for the review.

1. "Since Spark 2.0, string literals (including regex patterns) are unescaped 
in SQL parser", this description comes from Spark [official 
documentation|#regexp].]

!image-2024-03-09-11-13-49-064.png|width=491,height=176!

2. In Spark, unescape is indeed performed in the parser phase. Please view the 
details in `AstBuilder`: 
[https://github.com/apache/spark/blob/76b1c122cb7d77e8f175b25b935b9296a669d5d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L2876C1-L2882C4]

The default value of `spark.sql.parser.escapedStringLiterals` is false.

!image-2024-03-09-11-38-08-797.png|width=455,height=85!

 

3. In Hive, unescape is not done in the SQL AST parser phase, but in the Node 
normalization phase(`Dispatcher#dispatch`). `StrExprProcessor` is the processor 
for processing string unescape.

[https://github.com/apache/hive/blob/03a76ac70370fb94a78b00496ec511e671c652f2/ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java#L403C1-L405C17]

!image-2024-03-09-11-37-27-816.png|width=520,height=132!

4. "If unescaping is happening in Spark’s parser, Calcite should also do it in 
the parser",  I think this is unnecessary, 

First, like Spark and Hive, different engines have different processing 
methods, which do not necessarily have to be processed in the same phase. In 
addition, this unescape processing is global and not only for the `rlike` 
function. Finally, Calcite is handled in the `rlike` function, which is by far 
the simplest and minimal impact modification.

If Calcite also needs to perform global string unescape processing, it can be 
discussed separately in the subsequent Jira.

 

 


was (Author: eveywu):
[~julianhyde] Thanks for the review.

1. "Since Spark 2.0, string literals (including regex patterns) are unescaped 
in SQL parser", this description comes from Spark [official 
documentation|[https://spark.apache.org/docs/latest/api/sql/index.html#regexp].]

!image-2024-03-09-11-13-49-064.png|width=491,height=176!

2. In Spark, unescape is indeed performed in the parser phase. Please view the 
details in AstBuilder: 
[https://github.com/apache/spark/blob/76b1c122cb7d77e8f175b25b935b9296a669d5d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L2876C1-L2882C4]

The default value of `spark.sql.parser.escapedStringLiterals` is false.

!image-2024-03-09-11-38-08-797.png|width=455,height=85!

 

3. In Hive, unescape is not done in the SQL AST parser phase, but in the Node 
normalization phase(`Dispatcher#dispatch`). `StrExprProcessor` is the processor 
for processing string unescape.

[https://github.com/apache/hive/blob/03a76ac70370fb94a78b00496ec511e671c652f2/ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java#L403C1-L405C17]

!image-2024-03-09-11-37-27-816.png|width=520,height=132!

4. "If unescaping is happening in Spark’s parser, Calcite should also do it in 
the parser",  I think this is unnecessary, 

First, like Spark and Hive, different engines have different processing 
methods, which do not necessarily have to be processed in the same phase. In 
addition, this unescape processing is global and not only for the `rlike` 
function. Finally, Calcite is handled in the `rlike` function, which is by far 
the simplest and minimal impact modification.

If Calcite also needs to perform global string unescape processing, it can be 
discussed separately in the subsequent Jira.

 

 

> Add REGEXP, REGEXP_LIKE  function (enabled in Spark library)
> ------------------------------------------------------------
>
>                 Key: CALCITE-6278
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6278
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter:  EveyWu
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: image-2024-03-07-09-32-27-002.png, 
> image-2024-03-09-11-13-49-064.png, image-2024-03-09-11-37-27-816.png, 
> image-2024-03-09-11-38-08-797.png
>
>
> Add Spark functions that have been implemented but have different 
> OperandTypes/Returns.
> Add Function 
> [REGEXP|https://spark.apache.org/docs/latest/api/sql/index.html#regexp], 
> [REGEXP_LIKE|https://spark.apache.org/docs/latest/api/sql/index.html#regexp_like]
>  # Since this function has the same implementation as the Spark 
> [RLIKE|https://spark.apache.org/docs/latest/api/sql/index.html#rlike] 
> function, the implementation can be directly reused.
>  # Since Spark 2.0, string literals (including regex patterns) are unescaped 
> in SQL parser, also fix this bug in calcite.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CALCITE-6278) Add REGEXP, REGEXP_LIKE function (enabled in Spark library)

Reply via email to