Teddy Crepineau created SPARK-38245: ---------------------------------------
Summary: Avro Complex Union Type return `member$i` Key: SPARK-38245 URL: https://issues.apache.org/jira/browse/SPARK-38245 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.1 Environment: +OS+ * Debian GNU/Linux 10 (Docker Container) +packages & others+ * spark-avro_2.12-3.2.1 * python 3.7.3 * pyspark 3.2.1 * spark-3.2.1-bin-hadoop3.2 * Docker version 20.10.12 Reporter: Teddy Crepineau *Short Description* When reading complex union types from Avro files, there seems to be some information lost as the name of the record is omitted and {{member$i}} is instead returned. *Long Description* +Error+ Given the Avro schema {{{}schema.avsc{}}}, I would expected the schema when reading the avro file using {{read_avro.py}} to be as {{{}expected.txt{}}}. Instead, I get the schema output in {{reality.txt}} where {{RecordOne}} became {{{}member0{}}}, etc. This causes information lost and makes the DataFrame unusable. >From my understanding this behavior was implemented >[here.|https://github.com/databricks/spark-avro/pull/117] {code:java|title=read_avro.py} df = spark.read.format("avro").load("path/to/my/file.avro") df.printSchema() {code} {code:java|title=schema.avsc} { "type": "record", "name": "SomeData", "namespace": "my.name.space", "fields": [ { "name": "ts", "type": { "type": "long", "logicalType": "timestamp-millis" } }, { "name": "field_id", "type": [ "null", "string" ], "default": null }, { "name": "values", "type": [ { "type": "record", "name": "RecordOne", "fields": [ { "name": "field_a", "type": "long" }, { "name": "field_b", "type": { "type": "enum", "name": "FieldB", "symbols": [ "..." ], } }, { "name": "field_C", "type": { "type": "array", "items": "long" } } ] }, { "type": "record", "name": "RecordTwo", "fields": [ { "name": "field_a", "type": "long" } ] } ] } ] }{code} {code:java|title=expected.txt} root |-- ts: timestamp (nullable = true) |-- field_id: string (nullable = true) |-- values: struct (nullable = true) | |-- RecordOne: struct (nullable = true) | | |-- field_a: long (nullable = true) | | |-- field_b: string (nullable = true) | | |-- field_c: array (nullable = true) | | | |-- element: long (containsNull = true) | |-- RecordTwo: struct (nullable = true) | | |-- field_a: long (nullable = true) {code} {code:java|title=reality.txt} root |-- ts: timestamp (nullable = true) |-- field_id: string (nullable = true) |-- values: struct (nullable = true) | |-- member0: struct (nullable = true) | | |-- field_a: long (nullable = true) | | |-- field_b: string (nullable = true) | | |-- field_c: array (nullable = true) | | | |-- element: long (containsNull = true) | |-- member1: struct (nullable = true) | | |-- field_a: long (nullable = true) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org