[ https://issues.apache.org/jira/browse/SPARK-38245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-38245: --------------------------------- Labels: avro (was: avro newbie) > Avro Complex Union Type return `member$I` > ----------------------------------------- > > Key: SPARK-38245 > URL: https://issues.apache.org/jira/browse/SPARK-38245 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.1 > Environment: +OS+ > * Debian GNU/Linux 10 (Docker Container) > +packages & others+ > * spark-avro_2.12-3.2.1 > * python 3.7.3 > * pyspark 3.2.1 > * spark-3.2.1-bin-hadoop3.2 > * Docker version 20.10.12 > Reporter: Teddy Crepineau > Priority: Major > Labels: avro > > *Short Description* > When reading complex union types from Avro files, there seems to be some > information lost as the name of the record is omitted and {{member$i}} is > instead returned. > *Long Description* > +Error+ > Given the Avro schema {{{}schema.avsc{}}}, I would expected the schema when > reading the avro file using {{read_avro.py}} to be as {{{}expected.txt{}}}. > Instead, I get the schema output in {{reality.txt}} where {{RecordOne}} > became {{{}member0{}}}, etc. > This causes information lost and makes the DataFrame unusable. > From my understanding this behavior was implemented > [here.|https://github.com/databricks/spark-avro/pull/117] > > {code:java|title=read_avro.py} > df = spark.read.format("avro").load("path/to/my/file.avro") > df.printSchema() > {code} > {code:java|title=schema.avsc} > { > "type": "record", > "name": "SomeData", > "namespace": "my.name.space", > "fields": [ > { > "name": "ts", > "type": { > "type": "long", > "logicalType": "timestamp-millis" > } > }, > { > "name": "field_id", > "type": [ > "null", > "string" > ], > "default": null > }, > { > "name": "values", > "type": [ > { > "type": "record", > "name": "RecordOne", > "fields": [ > { > "name": "field_a", > "type": "long" > }, > { > "name": "field_b", > "type": { > "type": "enum", > "name": "FieldB", > "symbols": [ > "..." > ], > } > }, > { > "name": "field_C", > "type": { > "type": "array", > "items": "long" > } > } > ] > }, > { > "type": "record", > "name": "RecordTwo", > "fields": [ > { > "name": "field_a", > "type": "long" > } > ] > } > ] > } > ] > }{code} > {code:java|title=expected.txt} > root > |-- ts: timestamp (nullable = true) > |-- field_id: string (nullable = true) > |-- values: struct (nullable = true) > | |-- RecordOne: struct (nullable = true) > | | |-- field_a: long (nullable = true) > | | |-- field_b: string (nullable = true) > | | |-- field_c: array (nullable = true) > | | | |-- element: long (containsNull = true) > | |-- RecordTwo: struct (nullable = true) > | | |-- field_a: long (nullable = true) > {code} > {code:java|title=reality.txt} > root > |-- ts: timestamp (nullable = true) > |-- field_id: string (nullable = true) > |-- values: struct (nullable = true) > | |-- member0: struct (nullable = true) > | | |-- field_a: long (nullable = true) > | | |-- field_b: string (nullable = true) > | | |-- field_c: array (nullable = true) > | | | |-- element: long (containsNull = true) > | |-- member1: struct (nullable = true) > | | |-- field_a: long (nullable = true) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org