[ 
https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Murthy updated SPARK-28043:
---------------------------------
    Description: 
When reading a JSON blob with duplicate fields, Spark appears to ignore the 
value of the first one. JSON recommends unique names but does not require it; 
since JSON and Spark SQL both allow duplicate field names, we should fix the 
bug where the first column value is getting dropped.

 

I'm guessing somewhere when parsing JSON, we're turning it into a Map which is 
causing the first value to be overridden.

 

Repro (Python, 2.4):

>>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": 
>>> \"blah2\"}"])
 >>> df = spark.read.json(jsonRDD)
 >>> df.show()
 +-----+----+
|a|a|

+-----+----+
|null|blah2|

+-----+----+

 

The expected response would be:

+-----+----+
|a|a|

+-----+----+
|blah|blah2|

+-----+----+

  was:
When reading a JSON blob with duplicate fields, Spark appears to ignore the 
value of the first one. JSON recommends unique names but does not require it; 
since JSON and Spark SQL both allow duplicate field names, we should fix the 
bug where the first column value is getting dropped.

 

Repro (Python, 2.4):

>>> jsonRDD = spark.sparkContext.parallelize(["\{ \"a\": \"blah\", \"a\": 
>>> \"blah2\"}"])
>>> df = spark.read.json(jsonRDD)
>>> df.show()
+----+-----+
| a| a|
+----+-----+
|null|blah2|
+----+-----+

 

The expected response would be:

+----+-----+
| a| a|
+----+-----+
|blah|blah2|
+----+-----+


> Reading json with duplicate columns drops the first column value
> ----------------------------------------------------------------
>
>                 Key: SPARK-28043
>                 URL: https://issues.apache.org/jira/browse/SPARK-28043
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> When reading a JSON blob with duplicate fields, Spark appears to ignore the 
> value of the first one. JSON recommends unique names but does not require it; 
> since JSON and Spark SQL both allow duplicate field names, we should fix the 
> bug where the first column value is getting dropped.
>  
> I'm guessing somewhere when parsing JSON, we're turning it into a Map which 
> is causing the first value to be overridden.
>  
> Repro (Python, 2.4):
> >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": 
> >>> \"blah2\"}"])
>  >>> df = spark.read.json(jsonRDD)
>  >>> df.show()
>  +-----+----+
> |a|a|
> +-----+----+
> |null|blah2|
> +-----+----+
>  
> The expected response would be:
> +-----+----+
> |a|a|
> +-----+----+
> |blah|blah2|
> +-----+----+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to