[GitHub] [iceberg] kbendick commented on a change in pull request #4396: Spark - Ignore SQL comments when testing if SQL command is an Iceberg specific command

GitBox Fri, 25 Mar 2022 19:51:15 -0700


kbendick commented on a change in pull request #4396:
URL: https://github.com/apache/iceberg/pull/4396#discussion_r835697683




##########
File path: 
spark/v3.2/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala
##########
@@ -176,7 +176,12 @@ class IcebergSparkSqlExtensionsParser(delegate: 
ParserInterface) extends ParserI
   }
 
   private def isIcebergCommand(sqlText: String): Boolean = {
-    val normalized = 
sqlText.toLowerCase(Locale.ROOT).trim().replaceAll("\\s+", " ")
+    val normalized = sqlText.toLowerCase(Locale.ROOT).trim()
+      // Catch all SQL comments, including simple comments (that start with 
--), as well as multiline comments
+      .replaceAll("(?ms)/\\*.*?\\*/|--.*?\\n", " ")
+      // Replace any remaining newlines
+      .replaceAll("\\s+", " ")
+      .trim()

Review comment:
       Ok. I updated it to use a named function, as mentioned, but the 
constraints on that function are pretty large as we still want the antlr parser 
to be responsible for removing SQL comments given it can differentiate between 
hints and plain comments.
   
   I think it's better the original way and that the best solution would be to 
add additional comments clarifying why we're manipulating the `sqlText` within 
`isIcebergCommand`.
   
   I do agree that it is somewhat weird to be normalizing the SQL text directly 
within that function, but the normalized text should _only_ be used in that 
function. So doing so there helps prevent people from accidentally calling 
`normalizeSqlText` on sql text that will be passed to the parser.
   
   So my suggestion would be something more like:
   ```scala
     private def isIcebergCommand(sqlText: String): Boolean = {
       // The sqlText has not been passed through the parser yet and therefore 
has not had
       // unimportant comments and unnecessary whitespace and newlines removed.
       // 
       // SQL comments and newlines are removed here so that Iceberg commands 
can be properly recognized,
       // while still allowing the antlr parser to perform the actual work of 
differentiating
       // comments that don't have any meaning to the query from comments that 
behave as sql hints etc.
       val normalized = sqlText.toLowerCase(Locale.ROOT).trim()
         // Catch all SQL comments, including simple comments (that start with 
--), as well as comments
         // of the form /* ... */, including those that span multiple lines.
         .replaceAll("(?ms)/\\*.*?\\*/|--.*?\\n", " ")
         // Replace any remaining newlines
         .replaceAll("\\s+", " ")
         .trim()
       normalized.startsWith("call") || (
           normalized.startsWith("alter table") && (
               normalized.contains("add partition field") ||
               normalized.contains("drop partition field") ||
               normalized.contains("replace partition field") ||
               normalized.contains("write ordered by") ||
               normalized.contains("write locally ordered by") ||
               normalized.contains("write distributed by") ||
               normalized.contains("write unordered") ||
               normalized.contains("set identifier fields") ||
               normalized.contains("drop identifier fields")))
     }
   ```
   
   I'm not in love with this either, but I think what's making things a bit 
weird is the fact that we have to check whether or not something is an Iceberg 
specific command before using the actual parser. And I'm not prepared to 
refactor our parser that much. If that's a high priority, it should be done as 
a follow up.
   
   That said, if we want to keep the named function and rely on people using 
that function correctly in the future, I won't argue against it any further.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #4396: Spark - Ignore SQL comments when testing if SQL command is an Iceberg specific command

Reply via email to