[ https://issues.apache.org/jira/browse/SPARK-32431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maxim Gekk updated SPARK-32431: ------------------------------- Description: The code below throws org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`; for multiple file formats due to a duplicate column in the requested schema. {code:java} import org.apache.spark.sql.types._ spark.conf.set("spark.sql.caseSensitive", "false") val formats = Seq("parquet", "orc", "avro", "json") val caseInsensitiveSchema = new StructType().add("LowerCase", LongType).add("camelcase", LongType).add("CamelCase", LongType) formats.map{ format => val path = s"/tmp/$format" spark .range(1L) .selectExpr("id AS lowercase", "id + 1 AS camelCase") .write.mode("overwrite").format(format).save(path) spark.read.schema(caseInsensitiveSchema).format(format).load(path).show } {code} Similar code with nested schema behaves inconsistently across file formats and sometimes returns incorrect results: {code:java} import org.apache.spark.sql.types._ spark.conf.set("spark.sql.caseSensitive", "false") val formats = Seq("parquet", "orc", "avro", "json") val caseInsensitiveSchema = new StructType().add("StructColumn", new StructType().add("LowerCase", LongType).add("camelcase", LongType).add("CamelCase", LongType)) formats.map{ format => val path = s"/tmp/$format" spark .range(1L) .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") .write.mode("overwrite").format(format).save(path) spark.read.schema(caseInsensitiveSchema).format(format).load(path).show } {code} The desired behavior likely should be returning an exception just like in the flat schema scenario. was: The code below throws org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`; for multiple file formats due to a duplicate column in the requested schema. {code:java} import org.apache.spark.sql.types._ spark.conf.set("spark.sql.caseSensitive", "false") val formats = Seq("parquet", "orc", "avro", "json") val caseInsensitiveSchema = new StructType().add("LowerCase", LongType).add("camelcase", LongType).add("CamelCase", LongType) formats.map{ format => val path = s"/tmp/$format" spark .range(1L) .selectExpr("id AS lowercase", "id + 1 AS camelCase") .write.mode("overwrite").format(format).save(path) spark.read.schema(caseInsensitiveSchema).format(format).load(path).show } {code} Similar code with nested schema behaves inconsistently across file formats and sometimes returns incorrect results: {code:java} import org.apache.spark.sql.types._ spark.conf.set("spark.sql.caseSensitive", "false")val formats = Seq("parquet", "orc", "avro", "json")val caseInsensitiveSchema = new StructType().add("StructColumn", new StructType().add("LowerCase", LongType).add("camelcase", LongType).add("CamelCase", LongType))formats.map{ format => val path = s"/tmp/$format" spark .range(1L) .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") .write.mode("overwrite").format(format).save(path) spark.read.schema(caseInsensitiveSchema).format(format).load(path).show } {code} The desired behavior likely should be returning an exception just like in the flat schema scenario. > The .schema() API behaves incorrectly for nested schemas that have column > duplicates in case-insensitive mode > ------------------------------------------------------------------------------------------------------------- > > Key: SPARK-32431 > URL: https://issues.apache.org/jira/browse/SPARK-32431 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.6, 3.0.0 > Reporter: Michał Świtakowski > Priority: Major > > The code below throws org.apache.spark.sql.AnalysisException: Found duplicate > column(s) in the data schema: `camelcase`; for multiple file formats due to a > duplicate column in the requested schema. > {code:java} > import org.apache.spark.sql.types._ > spark.conf.set("spark.sql.caseSensitive", "false") > val formats = Seq("parquet", "orc", "avro", "json") > val caseInsensitiveSchema = new StructType().add("LowerCase", > LongType).add("camelcase", LongType).add("CamelCase", LongType) > formats.map{ format => > val path = s"/tmp/$format" > spark > .range(1L) > .selectExpr("id AS lowercase", "id + 1 AS camelCase") > .write.mode("overwrite").format(format).save(path) > spark.read.schema(caseInsensitiveSchema).format(format).load(path).show > } > {code} > Similar code with nested schema behaves inconsistently across file formats > and sometimes returns incorrect results: > {code:java} > import org.apache.spark.sql.types._ > spark.conf.set("spark.sql.caseSensitive", "false") > val formats = Seq("parquet", "orc", "avro", "json") > val caseInsensitiveSchema = new StructType().add("StructColumn", new > StructType().add("LowerCase", LongType).add("camelcase", > LongType).add("CamelCase", LongType)) > formats.map{ format => > val path = s"/tmp/$format" > spark > .range(1L) > .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS > StructColumn") > .write.mode("overwrite").format(format).save(path) > > spark.read.schema(caseInsensitiveSchema).format(format).load(path).show > } > {code} > The desired behavior likely should be returning an exception just like in the > flat schema scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org