kbendick commented on a change in pull request #4396:
URL: https://github.com/apache/iceberg/pull/4396#discussion_r835697683
##########
File path:
spark/v3.2/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala
##########
@@ -176,7 +176,12 @@ class IcebergSparkSqlExtensionsParser(delegate:
ParserInterface) extends ParserI
}
private def isIcebergCommand(sqlText: String): Boolean = {
- val normalized =
sqlText.toLowerCase(Locale.ROOT).trim().replaceAll("\\s+", " ")
+ val normalized = sqlText.toLowerCase(Locale.ROOT).trim()
+ // Catch all SQL comments, including simple comments (that start with
--), as well as multiline comments
+ .replaceAll("(?ms)/\\*.*?\\*/|--.*?\\n", " ")
+ // Replace any remaining newlines
+ .replaceAll("\\s+", " ")
+ .trim()
Review comment:
Ok. I updated it to use a named function, but the constraints on that
function are pretty large as we want the antlr parser to be responsible for
removing SQL comments etc.
I think it's better the original way and that the best solution would be to
add additional comments clarifying why we're manipulating the `sqlText` within
`isIcebergCommand`. I do concede that overall it is somewhat weird to be
normalizing the SQL text directly within that function, but doing so helps
prevent people from accidentally calling `normalizeSqlText` on sql text that
will be passed to the parser.
So my suggestion would be something more like:
```scala
private def isIcebergCommand(sqlText: String): Boolean = {
// The sqlText has not been passed through the parser yet and therefore
has not had
// unimportant comments and unnecessary whitespace and newlines removed.
//
// SQL comments and newlines are removed here so that Iceberg commands
can be properly recognized,
// while still allowing the antlr parser to perform the actual work of
differentiating
// comments that don't have any meaning to the query from comments that
behave as sql hints etc.
val normalized = sqlText.toLowerCase(Locale.ROOT).trim()
// Catch all SQL comments, including simple comments (that start with
--), as well as comments
// of the form /* ... */, including those that span multiple lines.
.replaceAll("(?ms)/\\*.*?\\*/|--.*?\\n", " ")
// Replace any remaining newlines
.replaceAll("\\s+", " ")
.trim()
normalized.startsWith("call") || (
normalized.startsWith("alter table") && (
normalized.contains("add partition field") ||
normalized.contains("drop partition field") ||
normalized.contains("replace partition field") ||
normalized.contains("write ordered by") ||
normalized.contains("write locally ordered by") ||
normalized.contains("write distributed by") ||
normalized.contains("write unordered") ||
normalized.contains("set identifier fields") ||
normalized.contains("drop identifier fields")))
}
```
I'm not in love with this either, but I think what's making things a bit
weird is the fact that we have to check whether or not something is an Iceberg
specific command before using the actual parser.
I'm not prepared to refactor the parser to that level yet, so I think his is
the best solution. That said, if we want to use the current `normalizeSqlText`
function with the comment / understanding that it should only be used in that
one spot, that is a decision I would leave up to the community. I personally
think it's unsafe in the long term to have a function, `normalizeSqlText`, that
is only meant to be used in one very specific location and which could have
dangerous side effects if used elsewhere.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]