[GitHub] [iceberg] kbendick commented on a change in pull request #4396: Spark - Ignore SQL comments when testing if SQL command is an Iceberg specific command

GitBox Fri, 25 Mar 2022 19:23:31 -0700


kbendick commented on a change in pull request #4396:
URL: https://github.com/apache/iceberg/pull/4396#discussion_r835697683




##########
File path: 
spark/v3.2/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala
##########
@@ -176,7 +176,12 @@ class IcebergSparkSqlExtensionsParser(delegate: 
ParserInterface) extends ParserI
   }
 
   private def isIcebergCommand(sqlText: String): Boolean = {
-    val normalized = 
sqlText.toLowerCase(Locale.ROOT).trim().replaceAll("\\s+", " ")
+    val normalized = sqlText.toLowerCase(Locale.ROOT).trim()
+      // Catch all SQL comments, including simple comments (that start with 
--), as well as multiline comments
+      .replaceAll("(?ms)/\\*.*?\\*/|--.*?\\n", " ")
+      // Replace any remaining newlines
+      .replaceAll("\\s+", " ")
+      .trim()

Review comment:
       Ok. I updated it to use a named function, but the constraints on that 
function are pretty large as we want the antlr parser to be responsible for 
removing SQL comments etc.
   
   I think it's better the original way and that the best solution would be to 
add additional comments clarifying why we're manipulating the `sqlText` within 
`isIcebergCommand`. I do concede that overall it is somewhat weird to be 
normalizing the SQL text directly within that function, but doing so helps 
prevent people from accidentally calling `normalizeSqlText` on sql text that 
will be passed to the parser.
   
   So my suggestion would be something more like:
   ```scala
     private def isIcebergCommand(sqlText: String): Boolean = {
       // The sqlText has not been passed through the parser yet and therefore 
has not had
       // unimportant comments and unnecessary whitespace and newlines removed.
       // 
       // SQL comments and newlines are removed here so that Iceberg commands 
can be properly recognized,
       // while still allowing the antlr parser to perform the actual work of 
differentiating
       // comments that don't have any meaning to the query from comments that 
behave as sql hints etc.
       val normalized = sqlText.toLowerCase(Locale.ROOT).trim()
         // Catch all SQL comments, including simple comments (that start with 
--), as well as comments
         // of the form /* ... */, including those that span multiple lines.
         .replaceAll("(?ms)/\\*.*?\\*/|--.*?\\n", " ")
         // Replace any remaining newlines
         .replaceAll("\\s+", " ")
         .trim()
       normalized.startsWith("call") || (
           normalized.startsWith("alter table") && (
               normalized.contains("add partition field") ||
               normalized.contains("drop partition field") ||
               normalized.contains("replace partition field") ||
               normalized.contains("write ordered by") ||
               normalized.contains("write locally ordered by") ||
               normalized.contains("write distributed by") ||
               normalized.contains("write unordered") ||
               normalized.contains("set identifier fields") ||
               normalized.contains("drop identifier fields")))
     }
   ```
   
   I'm not in love with this either, but I think what's making things a bit 
weird is the fact that we have to check whether or not something is an Iceberg 
specific command before using the actual parser.
   
   I'm not prepared to refactor the parser to that level yet, so I think his is 
the best solution. That said, if we want to use the current `normalizeSqlText` 
function with the comment / understanding that it should only be used in that 
one spot, that is a decision I would leave up to the community. I personally 
think it's unsafe in the long term to have a function, `normalizeSqlText`, that 
is only meant to be used in one very specific location and which could have 
dangerous side effects if used elsewhere.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #4396: Spark - Ignore SQL comments when testing if SQL command is an Iceberg specific command

Reply via email to