nnguyen168 commented on PR #55466:
URL: https://github.com/apache/spark/pull/55466#issuecomment-4528175469
Hi @yadavay-amzn , I did some deeper analysis and found an edge case with
nested block comments.
pure nested trailing comments are incorrectly kept when they should be
dropped:
| Input | Expected | Actual
|
|-----------------------------------|--------------|--------------------------------------------|
| SELECT 1; /* comment */ | ["SELECT 1"] | ["SELECT 1"] ✓
|
| SELECT 1; /* outer /* inner */ */ | ["SELECT 1"] | ["SELECT 1", " /*
outer /* inner */ */"] ✗ |
Root Cause
At line 697-698, hasPrecedingNonCommentString is recalculated for every
/*, including nested ones:
} else if (hasNext && line.charAt(index + 1) == '*') {
bracketedCommentLevel += 1
hasPrecedingNonCommentString = beginIndex != index &&
line.substring(beginIndex, index).exists(!_.isWhitespace)
}
When the second /* is encountered in /* outer /* inner */ */, the
substring includes the first /* characters, which are non-whitespace, so
hasPrecedingNonCommentString incorrectly becomes true.
Suggested Fix
Only set hasPrecedingNonCommentString when entering the outermost comment:
```
} else if (hasNext && line.charAt(index + 1) == '*') {
// Only set hasPrecedingNonCommentString when entering the outermost
comment
if (bracketedCommentLevel == 0) {
hasPrecedingNonCommentString = beginIndex != index &&
line.substring(beginIndex, index).exists(!_.isWhitespace)
}
bracketedCommentLevel += 1
}
```
Suggested Test Cases
```
// Nested comment WITH preceding SQL (already works)
"SELECT 1; SELECT 2 /* outer /* inner */ */" -> Seq("SELECT 1", " SELECT 2
/* outer /* inner */ */"),
// Pure nested trailing comment (needs fix to work)
"SELECT 1; /* outer /* inner */ */" -> Seq("SELECT 1"),
// Leading nested comment with statement
"/* outer /* inner */ */ SELECT 1;" -> Seq("/* outer /* inner */ */ SELECT
1"),
// Inline nested comment
"SELECT /* nested /* comment */ */ 1;" -> Seq("SELECT /* nested /* comment
*/ */ 1")
```
Happy to help with a follow-up PR if needed!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]