Dilip Biswal created SPARK-42118: ------------------------------------ Summary: Wrong result when parsing a multiline JSON file with differing types for same column Key: SPARK-42118 URL: https://issues.apache.org/jira/browse/SPARK-42118 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: Dilip Biswal
Here is a simple reproduction of the problem. We have a JSON file whose content looks like following and is in multiLine format. [{"name":""},{"name":123.34}] Here is the result of spark query when we read the above content. scala> val df = spark.read.format("json").option("multiLine", true).load("/tmp/json") df: org.apache.spark.sql.DataFrame = [name: double] scala> df.show(false) +----+ |name| +----+ |null| +----+ scala> df.count res5: Long = 2 This is quite a serious problem for us as it's causing us to master corrupt data in lake. If there is some issue with parsing the input, we expect spark set the "_corrupt_record" so that we can act on it. Please note that df.count is reporting 2 rows where as df.show only reports 1 row with null value. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org