cloud-fan commented on code in PR #55661:
URL: https://github.com/apache/spark/pull/55661#discussion_r3235920363
##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -557,6 +593,30 @@ private void parseFloatingPoint(JsonParser parser) throws
IOException {
}
}
+ // Reject JSON strings that contain unpaired UTF-16 surrogate code units.
Java strings can
+ // hold lone surrogates, but RFC 8259 section 7 requires JSON string
contents to be well-formed
+ // Unicode. Stricter parsers such as simdjson reject these inputs, while
Jackson's
+ // `ReaderBasedJsonParser` accepts them and silently drops the invalid
character to U+FFFD
+ // when the result is encoded as UTF-8. That silent replacement causes data
corruption, so
Review Comment:
The earlier wording suggestion doesn't appear to have landed — line 599
still reads "silently drops the invalid character to U+FFFD", which mixes
idioms ("drops…to"). The Javadoc on the new `parseJson` overload (line 69)
already uses "silently replaced … with"; matching that here:
```suggestion
// `ReaderBasedJsonParser` accepts them and silently replaces the invalid
character with
// U+FFFD when the result is encoded as UTF-8. That silent replacement
causes data
```
##########
sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala:
##########
@@ -185,6 +185,42 @@ class VariantEndToEndSuite extends SharedSparkSession {
checkAnswer(variantDF, Seq(Row(expected)))
}
+ test("SPARK-56654: parse_json/from_json reject unpaired UTF-16 surrogates by
default") {
+ val invalidJson = "\"\\uD835\""
+ val df = Seq(invalidJson).toDF("j")
+ checkAnswer(df.selectExpr("try_parse_json(j)"), Seq(Row(null)))
+ checkAnswer(df.selectExpr("from_json(j, 'variant')"), Seq(Row(null)))
+ val parseJsonError = intercept[SparkException] {
+ df.selectExpr("parse_json(j)").collect()
+ }
+ checkError(exception = parseJsonError,
+ condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+ parameters = Map(
+ "badRecord" -> invalidJson,
+ "failFastMode" -> "FAILFAST")
+ )
+
+ val fromJsonFailFast = intercept[SparkException] {
+ df.selectExpr("from_json(j, 'variant', map('mode',
'FAILFAST'))").collect()}
+ checkError(
+ exception = fromJsonFailFast,
+ condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+ parameters = Map(
+ "badRecord" -> "[null]",
+ "failFastMode" -> "FAILFAST"
+ )
+ )
Review Comment:
Two indentation issues in the same block: (a) the first `checkError(...)` is
flat-indented while the second one below it uses the standard 2-space indent —
inconsistent within the same test; (b) at line 204, the `}` closing the
`intercept` lambda is glued to `.collect()` and the trailing `checkError(` is
over-indented. Suggested cleanup matching the rest of the suite:
```suggestion
checkError(
exception = parseJsonError,
condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
parameters = Map("badRecord" -> invalidJson, "failFastMode" ->
"FAILFAST")
)
val fromJsonFailFast = intercept[SparkException] {
df.selectExpr("from_json(j, 'variant', map('mode',
'FAILFAST'))").collect()
}
checkError(
exception = fromJsonFailFast,
condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
parameters = Map("badRecord" -> "[null]", "failFastMode" -> "FAILFAST")
)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]