Yuan Yuan created SPARK-54102:
---------------------------------
Summary: Spark 4.0.1 still throws "String length (20054016)
exceeds the maximum length (20000000)" and "from_json" fails on a very large
JSON with a jackson_core parse error
Key: SPARK-54102
URL: https://issues.apache.org/jira/browse/SPARK-54102
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 4.0.1
Environment: pyspark 4.0.1
Reporter: Yuan Yuan
Based on JIRA *SPARK-49872* and the implementation in
[{{JsonProtocol.scala}}|https://github.com/apache/spark/blob/29434ea766b0fc3c3bf6eaadb43a8f931133649e/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L71],
the relevant limit was removed in {*}4.0.1{*}. However, in our environment we
can still reliably reproduce:
# When generating/processing a very large string:
{code:java}
Caused by: com.fasterxml.jackson.core.exc.StreamConstraintsException: String
value length (20040525) exceeds the maximum allowed (20000000, from
`StreamReadConstraints.getMaxStringLength()`){code}
# When using {{from_json}} on a _valid_ very large single-line JSON (no
missing comma), Jackson throws at around column {*}20,271,838{*}:
{code:java}
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character
('1' (code 49)): was expecting comma to separate Object entries
at [Source: UNKNOWN; line: 1, column: 20271838]{code}
I'm sure this is not a formatting issue. If I truncate the JSON to below column
{*}20,271,838{*}, it parses successfully.
Here is my parsing code:
{code:java}
raw_df.withColumn("parsed_item", f.from_json(f.col("item"), my_schema){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]