[jira] [Updated] (SPARK-34133) [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching

Erik Krogen (Jira) Wed, 20 Jan 2021 13:25:09 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-34133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Erik Krogen updated SPARK-34133:
--------------------------------
    Description: 
Spark SQL is normally case-insensitive (by default), but currently when 
{{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst 
schemas and Avro schemas, the matching is done in a case-sensitive manner. So 
for example the following will fail:
{code}
      val avroSchema =
        """
          |{
          |  "type" : "record",
          |  "name" : "test_schema",
          |  "fields" : [
          |    {"name": "foo", "type": "int"},
          |    {"name": "BAR", "type": "int"}
          |  ]
          |}
      """.stripMargin
      val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")

      df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
{code}
The same is true on the read path, if we assume {{testAvro}} has been written 
using the schema above, the below will fail to match the fields:
{code}
df.read.schema(new StructType().add("FOO", IntegerType).add("bar", IntegerType))
  .format("avro").load(testAvro)
{code}


  was:
Spark SQL is normally case-insensitive (by default), but currently when 
{{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst 
schemas and Avro schemas, the matching is done in a case-sensitive manner. So 
for example the following will fail:
{code}
      val avroSchema =
        """
          |{
          |  "type" : "record",
          |  "name" : "test_schema",
          |  "fields" : [
          |    {"name": "foo", "type": "int"},
          |    {"name": "BAR", "type": "int"}
          |  ]
          |}
      """.stripMargin
      val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")

      df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
{code}
The same is true on the read path, if we assume {{testAvro}} has been written 
using the schema above, the below will fail to match the fields:
{code}
df.read.schema(new StructType().add("FOO", IntegerType).add("bar", IntegerType))
  .format("avro").load(testAvro)
{code}

In addition the error messages in this type of failure scenario are very 
lacking in information on the write path ({{AvroSerializer}}), we can make them 
much more helpful for users to debug schema issues.


> [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field 
> matching
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-34133
>                 URL: https://issues.apache.org/jira/browse/SPARK-34133
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>    Affects Versions: 2.4.0, 3.2.0
>            Reporter: Erik Krogen
>            Priority: Major
>
> Spark SQL is normally case-insensitive (by default), but currently when 
> {{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst 
> schemas and Avro schemas, the matching is done in a case-sensitive manner. So 
> for example the following will fail:
> {code}
>       val avroSchema =
>         """
>           |{
>           |  "type" : "record",
>           |  "name" : "test_schema",
>           |  "fields" : [
>           |    {"name": "foo", "type": "int"},
>           |    {"name": "BAR", "type": "int"}
>           |  ]
>           |}
>       """.stripMargin
>       val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")
>       df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
> {code}
> The same is true on the read path, if we assume {{testAvro}} has been written 
> using the schema above, the below will fail to match the fields:
> {code}
> df.read.schema(new StructType().add("FOO", IntegerType).add("bar", 
> IntegerType))
>   .format("avro").load(testAvro)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34133) [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching

Reply via email to