[ https://issues.apache.org/jira/browse/SPARK-34133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Erik Krogen updated SPARK-34133: -------------------------------- Description: Spark SQL is normally case-insensitive (by default), but currently when {{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst schemas and Avro schemas, the matching is done in a case-sensitive manner. So for example the following will fail: {code} val avroSchema = """ |{ | "type" : "record", | "name" : "test_schema", | "fields" : [ | {"name": "foo", "type": "int"}, | {"name": "BAR", "type": "int"} | ] |} """.stripMargin val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar") df.write.option("avroSchema", avroSchema).format("avro").save(savePath) {code} The same is true on the read path, if we assume {{testAvro}} has been written using the schema above, the below will fail to match the fields: {code} df.read.schema(new StructType().add("FOO", IntegerType).add("bar", IntegerType)) .format("avro").load(testAvro) {code} was: Spark SQL is normally case-insensitive (by default), but currently when {{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst schemas and Avro schemas, the matching is done in a case-sensitive manner. So for example the following will fail: {code} val avroSchema = """ |{ | "type" : "record", | "name" : "test_schema", | "fields" : [ | {"name": "foo", "type": "int"}, | {"name": "BAR", "type": "int"} | ] |} """.stripMargin val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar") df.write.option("avroSchema", avroSchema).format("avro").save(savePath) {code} The same is true on the read path, if we assume {{testAvro}} has been written using the schema above, the below will fail to match the fields: {code} df.read.schema(new StructType().add("FOO", IntegerType).add("bar", IntegerType)) .format("avro").load(testAvro) {code} In addition the error messages in this type of failure scenario are very lacking in information on the write path ({{AvroSerializer}}), we can make them much more helpful for users to debug schema issues. > [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field > matching > ------------------------------------------------------------------------------- > > Key: SPARK-34133 > URL: https://issues.apache.org/jira/browse/SPARK-34133 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL > Affects Versions: 2.4.0, 3.2.0 > Reporter: Erik Krogen > Priority: Major > > Spark SQL is normally case-insensitive (by default), but currently when > {{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst > schemas and Avro schemas, the matching is done in a case-sensitive manner. So > for example the following will fail: > {code} > val avroSchema = > """ > |{ > | "type" : "record", > | "name" : "test_schema", > | "fields" : [ > | {"name": "foo", "type": "int"}, > | {"name": "BAR", "type": "int"} > | ] > |} > """.stripMargin > val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar") > df.write.option("avroSchema", avroSchema).format("avro").save(savePath) > {code} > The same is true on the read path, if we assume {{testAvro}} has been written > using the schema above, the below will fail to match the fields: > {code} > df.read.schema(new StructType().add("FOO", IntegerType).add("bar", > IntegerType)) > .format("avro").load(testAvro) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org