[ 
https://issues.apache.org/jira/browse/SPARK-38245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38245:
---------------------------------
    Labels: avro  (was: avro newbie)

> Avro Complex Union Type return `member$I`
> -----------------------------------------
>
>                 Key: SPARK-38245
>                 URL: https://issues.apache.org/jira/browse/SPARK-38245
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.1
>         Environment: +OS+
>  * Debian GNU/Linux 10 (Docker Container)
> +packages & others+
>  * spark-avro_2.12-3.2.1
>  * python 3.7.3
>  * pyspark 3.2.1
>  * spark-3.2.1-bin-hadoop3.2
>  * Docker version 20.10.12
>            Reporter: Teddy Crepineau
>            Priority: Major
>              Labels: avro
>
> *Short Description*
> When reading complex union types from Avro files, there seems to be some 
> information lost as the name of the record is omitted and {{member$i}} is 
> instead returned.
> *Long Description*
> +Error+
> Given the Avro schema {{{}schema.avsc{}}}, I would expected the schema when 
> reading the avro file using {{read_avro.py}} to be as {{{}expected.txt{}}}. 
> Instead, I get the schema output in {{reality.txt}} where {{RecordOne}} 
> became {{{}member0{}}}, etc.
> This causes information lost and makes the DataFrame unusable.
> From my understanding this behavior was implemented 
> [here.|https://github.com/databricks/spark-avro/pull/117]
>  
> {code:java|title=read_avro.py}
> df = spark.read.format("avro").load("path/to/my/file.avro")
> df.printSchema()
>  {code}
> {code:java|title=schema.avsc}
>  {
>  "type": "record",
>  "name": "SomeData",
>  "namespace": "my.name.space",
>  "fields": [
>   {
>    "name": "ts",
>    "type": {
>     "type": "long",
>     "logicalType": "timestamp-millis"
>    }
>   },
>   {
>    "name": "field_id",
>    "type": [
>     "null",
>     "string"
>    ],
>    "default": null
>   },
>   {
>    "name": "values",
>    "type": [
>     {
>      "type": "record",
>      "name": "RecordOne",
>      "fields": [
>       {
>        "name": "field_a",
>        "type": "long"
>       },
>       {
>        "name": "field_b",
>        "type": {
>         "type": "enum",
>         "name": "FieldB",
>         "symbols": [
>             "..."
>         ],
>        }
>       },
>       {
>        "name": "field_C",
>        "type": {
>         "type": "array",
>         "items": "long"
>        }
>       }
>      ]
>     },
>     {
>      "type": "record",
>      "name": "RecordTwo",
>      "fields": [
>       {
>        "name": "field_a",
>        "type": "long"
>       }
>      ]
>     }
>    ]
>   }
>  ]
> }{code}
> {code:java|title=expected.txt}
> root
>  |-- ts: timestamp (nullable = true)
>  |-- field_id: string (nullable = true)
>  |-- values: struct (nullable = true)
>  |    |-- RecordOne: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
>  |    |    |-- field_b: string (nullable = true)
>  |    |    |-- field_c: array (nullable = true)
>  |    |    |    |-- element: long (containsNull = true)
>  |    |-- RecordTwo: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
> {code}
> {code:java|title=reality.txt}
> root
>  |-- ts: timestamp (nullable = true)
>  |-- field_id: string (nullable = true)
>  |-- values: struct (nullable = true)
>  |    |-- member0: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
>  |    |    |-- field_b: string (nullable = true)
>  |    |    |-- field_c: array (nullable = true)
>  |    |    |    |-- element: long (containsNull = true)
>  |    |-- member1: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to