[ https://issues.apache.org/jira/browse/SPARK-38245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494294#comment-17494294 ]
Erik Krogen edited comment on SPARK-38245 at 2/17/22, 11:45 PM: ---------------------------------------------------------------- This behavior is expected. The fields of the union are expanded based on their position. I guess you're proposing that the name of the type be used as the name of the field, instead of a positional name? This will get pretty confusing for unions of primitives, e.g. the following type: {code:json} { "name": "foo" "type": ["int", "long"] } {code} will have Spark type: {code} root |-- foo: struct | |-- int: int | |-- long: long {code} Names of types being used as the name of a field looks very confusing, at least to me. Another problem, you could end up with duplicate field names like: {code:json} { "name": "foo" "type": [{ "type": "record", "name": "RecordOne", "namespace": "foo" }, { "type": "record", "name": "RecordOne", "namespace": "bar" } ] } {code} Since namespaces are different, they are different types, and this is a valid union. But in your proposal, they will result in the same field name. Unless we include the namespace in the field name as well, but this will get messy quickly. was (Author: xkrogen): This behavior is expected. The fields of the union are expanded based on their position. I guess you're proposing that the name of the type be used as the name of the field, instead of a positional name? This will get pretty confusing for unions of primitives, e.g. the following type: {code:java} { "name": "foo" "type": ["int", "long"] } {code} will have Spark type: {code:java} root |-- foo: struct | |-- int: int | |-- long: long {code} Names of types being used as the name of a field looks very confusing, at least to me. Another problem, you could end up with duplicate field names like: {code:java} { "name": "foo" "type": [{ "type": "record", "name": "RecordOne", "namespace": "foo" }, { "type": "record", "name": "RecordOne", "namespace": "bar" } ] } {code} Since namespaces are different, they are different types, and this is a valid union. But in your proposal, they will result in the same field name. Unless we include the namespace in the field name as well, but this will get messy quickly. > Avro Complex Union Type return `member$I` > ----------------------------------------- > > Key: SPARK-38245 > URL: https://issues.apache.org/jira/browse/SPARK-38245 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.2.1 > Environment: +OS+ > * Debian GNU/Linux 10 (Docker Container) > +packages & others+ > * spark-avro_2.12-3.2.1 > * python 3.7.3 > * pyspark 3.2.1 > * spark-3.2.1-bin-hadoop3.2 > * Docker version 20.10.12 > Reporter: Teddy Crepineau > Priority: Major > Labels: avro, newbie > > *Short Description* > When reading complex union types from Avro files, there seems to be some > information lost as the name of the record is omitted and {{member$i}} is > instead returned. > *Long Description* > +Error+ > Given the Avro schema {{{}schema.avsc{}}}, I would expected the schema when > reading the avro file using {{read_avro.py}} to be as {{{}expected.txt{}}}. > Instead, I get the schema output in {{reality.txt}} where {{RecordOne}} > became {{{}member0{}}}, etc. > This causes information lost and makes the DataFrame unusable. > From my understanding this behavior was implemented > [here.|https://github.com/databricks/spark-avro/pull/117] > > {code:java|title=read_avro.py} > df = spark.read.format("avro").load("path/to/my/file.avro") > df.printSchema() > {code} > {code:java|title=schema.avsc} > { > "type": "record", > "name": "SomeData", > "namespace": "my.name.space", > "fields": [ > { > "name": "ts", > "type": { > "type": "long", > "logicalType": "timestamp-millis" > } > }, > { > "name": "field_id", > "type": [ > "null", > "string" > ], > "default": null > }, > { > "name": "values", > "type": [ > { > "type": "record", > "name": "RecordOne", > "fields": [ > { > "name": "field_a", > "type": "long" > }, > { > "name": "field_b", > "type": { > "type": "enum", > "name": "FieldB", > "symbols": [ > "..." > ], > } > }, > { > "name": "field_C", > "type": { > "type": "array", > "items": "long" > } > } > ] > }, > { > "type": "record", > "name": "RecordTwo", > "fields": [ > { > "name": "field_a", > "type": "long" > } > ] > } > ] > } > ] > }{code} > {code:java|title=expected.txt} > root > |-- ts: timestamp (nullable = true) > |-- field_id: string (nullable = true) > |-- values: struct (nullable = true) > | |-- RecordOne: struct (nullable = true) > | | |-- field_a: long (nullable = true) > | | |-- field_b: string (nullable = true) > | | |-- field_c: array (nullable = true) > | | | |-- element: long (containsNull = true) > | |-- RecordTwo: struct (nullable = true) > | | |-- field_a: long (nullable = true) > {code} > {code:java|title=reality.txt} > root > |-- ts: timestamp (nullable = true) > |-- field_id: string (nullable = true) > |-- values: struct (nullable = true) > | |-- member0: struct (nullable = true) > | | |-- field_a: long (nullable = true) > | | |-- field_b: string (nullable = true) > | | |-- field_c: array (nullable = true) > | | | |-- element: long (containsNull = true) > | |-- member1: struct (nullable = true) > | | |-- field_a: long (nullable = true) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org