[ https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863680#comment-16863680 ]
Liang-Chi Hsieh commented on SPARK-28043: ----------------------------------------- I tried to look around that, like https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object. So JSON doesn't disallow duplicate keys. Spark SQL doesn't disallow duplicate field names, although it can be impose some difficulties when using a DataFrame with duplicate field names. To clarify it, just because Spark SQL allows duplicate field names that doesn't mean that we should use such feature. But I think that, to some extent, the current behavior isn't consistent. {code} scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", \"a\": \"blah2\"} ]")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:23 scala> val df = spark.read.json(jsonRDD) df: org.apache.spark.sql.DataFrame = [a: string, a: string] scala> df.show +----+-----+ | a| a| +----+-----+ |null|blah2| +----+-----+ {code} > Reading json with duplicate columns drops the first column value > ---------------------------------------------------------------- > > Key: SPARK-28043 > URL: https://issues.apache.org/jira/browse/SPARK-28043 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.0 > Reporter: Mukul Murthy > Priority: Major > > When reading a JSON blob with duplicate fields, Spark appears to ignore the > value of the first one. JSON recommends unique names but does not require it; > since JSON and Spark SQL both allow duplicate field names, we should fix the > bug where the first column value is getting dropped. > > I'm guessing somewhere when parsing JSON, we're turning it into a Map which > is causing the first value to be overridden. > > Repro (Python, 2.4): > >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": > >>> \"blah2\"}"]) > >>> df = spark.read.json(jsonRDD) > >>> df.show() > +-----+----+ > |a|a| > +-----+----+ > |null|blah2| > +-----+----+ > > The expected response would be: > +-----+----+ > |a|a| > +-----+----+ > |blah|blah2| > +-----+----+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org