Re: [PR] [SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON parsing [spark]

via GitHub Wed, 13 May 2026 09:34:07 -0700


cloud-fan commented on code in PR #55661:
URL: https://github.com/apache/spark/pull/55661#discussion_r3235920363



##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -557,6 +593,30 @@ private void parseFloatingPoint(JsonParser parser) throws 
IOException {
     }
   }
 
+  // Reject JSON strings that contain unpaired UTF-16 surrogate code units. 
Java strings can
+  // hold lone surrogates, but RFC 8259 section 7 requires JSON string 
contents to be well-formed
+  // Unicode. Stricter parsers such as simdjson reject these inputs, while 
Jackson's
+  // `ReaderBasedJsonParser` accepts them and silently drops the invalid 
character to U+FFFD
+  // when the result is encoded as UTF-8. That silent replacement causes data 
corruption, so

Review Comment:
   The earlier wording suggestion doesn't appear to have landed — line 599 
still reads "silently drops the invalid character to U+FFFD", which mixes 
idioms ("drops…to"). The Javadoc on the new `parseJson` overload (line 69) 
already uses "silently replaced … with"; matching that here:
   
   ```suggestion
     // `ReaderBasedJsonParser` accepts them and silently replaces the invalid 
character with
     // U+FFFD when the result is encoded as UTF-8. That silent replacement 
causes data
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala:
##########
@@ -185,6 +185,42 @@ class VariantEndToEndSuite extends SharedSparkSession {
     checkAnswer(variantDF, Seq(Row(expected)))
   }
 
+  test("SPARK-56654: parse_json/from_json reject unpaired UTF-16 surrogates by 
default") {
+    val invalidJson = "\"\\uD835\""
+    val df = Seq(invalidJson).toDF("j")
+    checkAnswer(df.selectExpr("try_parse_json(j)"), Seq(Row(null)))
+    checkAnswer(df.selectExpr("from_json(j, 'variant')"), Seq(Row(null)))
+    val parseJsonError = intercept[SparkException] {
+      df.selectExpr("parse_json(j)").collect()
+    }
+    checkError(exception = parseJsonError,
+    condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+    parameters = Map(
+    "badRecord" -> invalidJson,
+    "failFastMode" -> "FAILFAST")
+    )
+
+    val fromJsonFailFast = intercept[SparkException] {
+      df.selectExpr("from_json(j, 'variant', map('mode', 
'FAILFAST'))").collect()}
+      checkError(
+        exception = fromJsonFailFast,
+        condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+        parameters = Map(
+          "badRecord" -> "[null]",
+          "failFastMode" -> "FAILFAST"
+          )
+        )

Review Comment:
   Two indentation issues in the same block: (a) the first `checkError(...)` is 
flat-indented while the second one below it uses the standard 2-space indent — 
inconsistent within the same test; (b) at line 204, the `}` closing the 
`intercept` lambda is glued to `.collect()` and the trailing `checkError(` is 
over-indented. Suggested cleanup matching the rest of the suite:
   
   ```suggestion
       checkError(
         exception = parseJsonError,
         condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
         parameters = Map("badRecord" -> invalidJson, "failFastMode" -> 
"FAILFAST")
       )
   
       val fromJsonFailFast = intercept[SparkException] {
         df.selectExpr("from_json(j, 'variant', map('mode', 
'FAILFAST'))").collect()
       }
       checkError(
         exception = fromJsonFailFast,
         condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
         parameters = Map("badRecord" -> "[null]", "failFastMode" -> "FAILFAST")
       )
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON parsing [spark]

Reply via email to