kbendick commented on a change in pull request #4396:
URL: https://github.com/apache/iceberg/pull/4396#discussion_r835697683
##########
File path:
spark/v3.2/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala
##########
@@ -176,7 +176,12 @@ class IcebergSparkSqlExtensionsParser(delegate:
ParserInterface) extends ParserI
}
private def isIcebergCommand(sqlText: String): Boolean = {
- val normalized =
sqlText.toLowerCase(Locale.ROOT).trim().replaceAll("\\s+", " ")
+ val normalized = sqlText.toLowerCase(Locale.ROOT).trim()
+ // Catch all SQL comments, including simple comments (that start with
--), as well as multiline comments
+ .replaceAll("(?ms)/\\*.*?\\*/|--.*?\\n", " ")
+ // Replace any remaining newlines
+ .replaceAll("\\s+", " ")
+ .trim()
Review comment:
Ok. I updated it to use a named function, as mentioned, but the
constraints on that function are pretty large as we still want the antlr parser
to be responsible for removing SQL comments given it can differentiate between
hints and plain comments.
I think it's better the original way and that the best solution would be to
add additional comments clarifying why we're manipulating the `sqlText` within
`isIcebergCommand`.
I do agree that it is somewhat weird to be normalizing the SQL text directly
within that function, but the normalized text should _only_ be used in that
function. So doing so there helps prevent people from accidentally calling
`normalizeSqlText` on sql text that will be passed to the parser.
So my suggestion would be something more like:
```scala
private def isIcebergCommand(sqlText: String): Boolean = {
// The sqlText has not been passed through the parser yet and therefore
has not had
// unimportant comments and unnecessary whitespace and newlines removed.
//
// SQL comments and newlines are removed here so that Iceberg commands
can be properly recognized,
// while still allowing the antlr parser to perform the actual work of
differentiating
// comments that don't have any meaning to the query from comments that
behave as sql hints etc.
val normalized = sqlText.toLowerCase(Locale.ROOT).trim()
// Catch all SQL comments, including simple comments (that start with
--), as well as comments
// of the form /* ... */, including those that span multiple lines.
.replaceAll("(?ms)/\\*.*?\\*/|--.*?\\n", " ")
// Replace any remaining newlines
.replaceAll("\\s+", " ")
.trim()
normalized.startsWith("call") || (
normalized.startsWith("alter table") && (
normalized.contains("add partition field") ||
normalized.contains("drop partition field") ||
normalized.contains("replace partition field") ||
normalized.contains("write ordered by") ||
normalized.contains("write locally ordered by") ||
normalized.contains("write distributed by") ||
normalized.contains("write unordered") ||
normalized.contains("set identifier fields") ||
normalized.contains("drop identifier fields")))
}
```
I'm not in love with this either, but I think what's making things a bit
weird is the fact that we have to check whether or not something is an Iceberg
specific command before using the actual parser. And I'm not prepared to
refactor our parser that much. If that's a high priority, it should be done as
a follow up.
That said, if we want to keep the named function and rely on people using
that function correctly in the future, I won't argue against it any further.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]