[ https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuxiang Wei updated SPARK-48689: -------------------------------- Description: When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark: {{import json}} {{import tempfile}} {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} {{spark = (SparkSession.builde}} {{ .appName("PySpark JSON Example")}} {{ .getOrCreate()}} {{{}){}}}{{{}# Define the JSON content{}}} {{data = {}} {{ "text": "a" * 100000000}} {{{}}{}}}{{{}# Create a temporary file{}}} {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:}} {{ # Write the JSON content to the temporary file}} {{ tmp_file.write(json.dumps(data) + "\n")}} {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a PySpark DataFrame}} {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} {{ print(df)}} was: When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark: {{import json}} {{import tempfile}} {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} {{spark = SparkSession.builder \}} {{ .appName("PySpark JSON Example") \}} {{{} .getOrCreate(){}}}{{{}# Define the JSON content{}}} {{data = {}} {{ "text": "a" * 100000000}} {{{}}{}}}{{{}# Create a temporary file{}}} {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:}} {{ # Write the JSON content to the temporary file}} {{ tmp_file.write(json.dumps(data) + "\n")}} {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a PySpark DataFrame}} {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} {{ print(df)}} > Reading lengthy JSON results in a corrupted record. > --------------------------------------------------- > > Key: SPARK-48689 > URL: https://issues.apache.org/jira/browse/SPARK-48689 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.1 > Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22 > Reporter: Yuxiang Wei > Priority: Major > Labels: Reader > > When reading a data frame from a JSON file including a very long string, > spark will incorrectly make it a corrupted record even though the format is > correct. Here is a minimal example with PySpark: > {{import json}} > {{import tempfile}} > {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} > {{spark = (SparkSession.builde}} > {{ .appName("PySpark JSON Example")}} > {{ .getOrCreate()}} > {{{}){}}}{{{}# Define the JSON content{}}} > {{data = {}} > {{ "text": "a" * 100000000}} > {{{}}{}}}{{{}# Create a temporary file{}}} > {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as > tmp_file:}} > {{ # Write the JSON content to the temporary file}} > {{ tmp_file.write(json.dumps(data) + "\n")}} > {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a > PySpark DataFrame}} > {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} > {{ print(df)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org