[ 
https://issues.apache.org/jira/browse/SPARK-38245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494691#comment-17494691
 ] 

Erik Krogen commented on SPARK-38245:
-------------------------------------

{quote}
now I am curious to know why including the namespace would "get messy quickly"
{quote}
Namespaces (in my experience) are often fairly long, so let's say we have 
records like {{org.apache.spark.avrotest.RecordOne}}. This is a very long and 
unwieldy field name! Plus, I don't think we allow periods in field names, so we 
would probably replace them with underscores. Personally I feel I would rather 
refer to a field like {{member1}} than a field like 
{{org_apache_spark_avrotest_RecordOne}} -- but I suppose that there are valid 
arguments for the longer one.

{quote}
I am wondering if other solutions could help make this decoding of avro files 
to DataFrame loss-less (e.g. keeping a mapping of positional name to field 
names, etc.)
{quote}
I am not really seeing how the conversion is lossy, currently. Taking the 
example I shared above with an int/long union, the current schema will be:
{code}
root
 |-- foo: struct
 |    |-- member0: int
 |    |-- member1: long
{code}
If you want to derive the original union branches, you can just check the types 
in the schema. The mapping you described is already maintained implicitly in 
the mapping of field names to their types.

> Avro Complex Union Type return `member$I`
> -----------------------------------------
>
>                 Key: SPARK-38245
>                 URL: https://issues.apache.org/jira/browse/SPARK-38245
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.1
>         Environment: +OS+
>  * Debian GNU/Linux 10 (Docker Container)
> +packages & others+
>  * spark-avro_2.12-3.2.1
>  * python 3.7.3
>  * pyspark 3.2.1
>  * spark-3.2.1-bin-hadoop3.2
>  * Docker version 20.10.12
>            Reporter: Teddy Crepineau
>            Priority: Major
>              Labels: avro, newbie
>
> *Short Description*
> When reading complex union types from Avro files, there seems to be some 
> information lost as the name of the record is omitted and {{member$i}} is 
> instead returned.
> *Long Description*
> +Error+
> Given the Avro schema {{{}schema.avsc{}}}, I would expected the schema when 
> reading the avro file using {{read_avro.py}} to be as {{{}expected.txt{}}}. 
> Instead, I get the schema output in {{reality.txt}} where {{RecordOne}} 
> became {{{}member0{}}}, etc.
> This causes information lost and makes the DataFrame unusable.
> From my understanding this behavior was implemented 
> [here.|https://github.com/databricks/spark-avro/pull/117]
>  
> {code:java|title=read_avro.py}
> df = spark.read.format("avro").load("path/to/my/file.avro")
> df.printSchema()
>  {code}
> {code:java|title=schema.avsc}
>  {
>  "type": "record",
>  "name": "SomeData",
>  "namespace": "my.name.space",
>  "fields": [
>   {
>    "name": "ts",
>    "type": {
>     "type": "long",
>     "logicalType": "timestamp-millis"
>    }
>   },
>   {
>    "name": "field_id",
>    "type": [
>     "null",
>     "string"
>    ],
>    "default": null
>   },
>   {
>    "name": "values",
>    "type": [
>     {
>      "type": "record",
>      "name": "RecordOne",
>      "fields": [
>       {
>        "name": "field_a",
>        "type": "long"
>       },
>       {
>        "name": "field_b",
>        "type": {
>         "type": "enum",
>         "name": "FieldB",
>         "symbols": [
>             "..."
>         ],
>        }
>       },
>       {
>        "name": "field_C",
>        "type": {
>         "type": "array",
>         "items": "long"
>        }
>       }
>      ]
>     },
>     {
>      "type": "record",
>      "name": "RecordTwo",
>      "fields": [
>       {
>        "name": "field_a",
>        "type": "long"
>       }
>      ]
>     }
>    ]
>   }
>  ]
> }{code}
> {code:java|title=expected.txt}
> root
>  |-- ts: timestamp (nullable = true)
>  |-- field_id: string (nullable = true)
>  |-- values: struct (nullable = true)
>  |    |-- RecordOne: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
>  |    |    |-- field_b: string (nullable = true)
>  |    |    |-- field_c: array (nullable = true)
>  |    |    |    |-- element: long (containsNull = true)
>  |    |-- RecordTwo: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
> {code}
> {code:java|title=reality.txt}
> root
>  |-- ts: timestamp (nullable = true)
>  |-- field_id: string (nullable = true)
>  |-- values: struct (nullable = true)
>  |    |-- member0: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
>  |    |    |-- field_b: string (nullable = true)
>  |    |    |-- field_c: array (nullable = true)
>  |    |    |    |-- element: long (containsNull = true)
>  |    |-- member1: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to