[jira] [Commented] (SPARK-28043) Reading json with duplicate columns drops the first column value

Liang-Chi Hsieh (JIRA) Fri, 14 Jun 2019 07:16:15 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864107#comment-16864107
 ]


Liang-Chi Hsieh commented on SPARK-28043:
-----------------------------------------

To make duplicate JSON keys work, I think about it and look at our current 
implementation. One concern is that how do we know which key maps to which 
Spark SQL field?

Suppose we have two duplicate keys "a" as above. We infer the schema of Spark 
SQL as "a string, a string". Does the order of keys in JSON string imply the 
order of fields? In our current implementation, such mapping doesn't exist. It 
means the order of keys can be different in each JSON string.

Isn't it prone to unware error when reading JSON?

Another option to forbid duplicate JSON keys. Maybe add a legacy config for 
fallback to current behavior, if we don't want to break existing code?





> Reading json with duplicate columns drops the first column value
> ----------------------------------------------------------------
>
>                 Key: SPARK-28043
>                 URL: https://issues.apache.org/jira/browse/SPARK-28043
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> When reading a JSON blob with duplicate fields, Spark appears to ignore the 
> value of the first one. JSON recommends unique names but does not require it; 
> since JSON and Spark SQL both allow duplicate field names, we should fix the 
> bug where the first column value is getting dropped.
>  
> I'm guessing somewhere when parsing JSON, we're turning it into a Map which 
> is causing the first value to be overridden.
>  
> Repro (Python, 2.4):
> {code}
> scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", 
> \"a\": \"blah2\"} ]"))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at 
> parallelize at <console>:23
> scala> val df = spark.read.json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [a: string, a: string]                   
>   
> scala> df.show
> +----+-----+
> |   a|    a|
> +----+-----+
> |null|blah2|
> +----+-----+
> {code}
>  
> The expected response would be:
> {code}
> +----+-----+
> |   a|    a|
> +----+-----+
> |blah|blah2|
> +----+-----+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28043) Reading json with duplicate columns drops the first column value

Reply via email to