Yuxiang Wei created SPARK-48689: ----------------------------------- Summary: Reading lengthy JSON results in a corrupted record. Key: SPARK-48689 URL: https://issues.apache.org/jira/browse/SPARK-48689 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.1 Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22 Reporter: Yuxiang Wei
When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark: {{import json}} {{import tempfile}} {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} {{spark = SparkSession.builder \}} {{ .appName("PySpark JSON Example") \}} {{{} .getOrCreate(){}}}{{{}# Define the JSON content{}}} {{data = {}} {{ "text": "a" * 100000000}} {{{}}{}}}{{{}# Create a temporary file{}}} {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:}} {{ # Write the JSON content to the temporary file}} {{ tmp_file.write(json.dumps(data) + "\n")}} {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a PySpark DataFrame}} {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} {{ print(df)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org