Teddy Crepineau created SPARK-38245:
---------------------------------------

             Summary: Avro Complex Union Type return `member$i`
                 Key: SPARK-38245
                 URL: https://issues.apache.org/jira/browse/SPARK-38245
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.1
         Environment: +OS+
 * Debian GNU/Linux 10 (Docker Container)

+packages & others+
 * spark-avro_2.12-3.2.1
 * python 3.7.3
 * pyspark 3.2.1

 * spark-3.2.1-bin-hadoop3.2
 * Docker version 20.10.12
            Reporter: Teddy Crepineau


*Short Description*

When reading complex union types from Avro files, there seems to be some 
information lost as the name of the record is omitted and {{member$i}} is 
instead returned.

*Long Description*

+Error+

Given the Avro schema {{{}schema.avsc{}}}, I would expected the schema when 
reading the avro file using {{read_avro.py}} to be as {{{}expected.txt{}}}. 
Instead, I get the schema output in {{reality.txt}} where {{RecordOne}} became 
{{{}member0{}}}, etc.

This causes information lost and makes the DataFrame unusable.

>From my understanding this behavior was implemented 
>[here.|https://github.com/databricks/spark-avro/pull/117]

 
{code:java|title=read_avro.py}
df = spark.read.format("avro").load("path/to/my/file.avro")
df.printSchema()
 {code}
{code:java|title=schema.avsc}
 {
 "type": "record",
 "name": "SomeData",
 "namespace": "my.name.space",
 "fields": [
  {
   "name": "ts",
   "type": {
    "type": "long",
    "logicalType": "timestamp-millis"
   }
  },
  {
   "name": "field_id",
   "type": [
    "null",
    "string"
   ],
   "default": null
  },
  {
   "name": "values",
   "type": [
    {
     "type": "record",
     "name": "RecordOne",
     "fields": [
      {
       "name": "field_a",
       "type": "long"
      },
      {
       "name": "field_b",
       "type": {
        "type": "enum",
        "name": "FieldB",
        "symbols": [
            "..."
        ],
       }
      },
      {
       "name": "field_C",
       "type": {
        "type": "array",
        "items": "long"
       }
      }
     ]
    },
    {
     "type": "record",
     "name": "RecordTwo",
     "fields": [
      {
       "name": "field_a",
       "type": "long"
      }
     ]
    }
   ]
  }
 ]
}{code}
{code:java|title=expected.txt}
root
 |-- ts: timestamp (nullable = true)
 |-- field_id: string (nullable = true)
 |-- values: struct (nullable = true)
 |    |-- RecordOne: struct (nullable = true)
 |    |    |-- field_a: long (nullable = true)
 |    |    |-- field_b: string (nullable = true)
 |    |    |-- field_c: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |-- RecordTwo: struct (nullable = true)
 |    |    |-- field_a: long (nullable = true)
{code}
{code:java|title=reality.txt}
root
 |-- ts: timestamp (nullable = true)
 |-- field_id: string (nullable = true)
 |-- values: struct (nullable = true)
 |    |-- member0: struct (nullable = true)
 |    |    |-- field_a: long (nullable = true)
 |    |    |-- field_b: string (nullable = true)
 |    |    |-- field_c: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |-- member1: struct (nullable = true)
 |    |    |-- field_a: long (nullable = true)
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to