[ 
https://issues.apache.org/jira/browse/SPARK-51961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51961:
-----------------------------------
    Labels: pull-request-available  (was: )

> from_avro fails after to_avro working with the same jsonFormatSchema
> --------------------------------------------------------------------
>
>                 Key: SPARK-51961
>                 URL: https://issues.apache.org/jira/browse/SPARK-51961
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2
>            Reporter: Rodrigo Cardoso
>            Priority: Major
>              Labels: pull-request-available
>
> I was able to reproduce the behavior in summary with the following snippet:
> {code:java}
> from pyspark.sql import Row, SparkSession
> from pyspark.sql.avro.functions import to_avro, from_avro
> # Initialize Spark session
> spark = (
>     SparkSession.builder.appName("AvroSerializationExample")
>     .config(
>         "spark.jars.packages",
>         "org.apache.spark:spark-avro_2.12:3.1.2"
>     )
>     .getOrCreate()
> )
> jsonFormatSchema = """[{"type":"record","name":"value","fields":[{"name"
> :"age","type":["long","null"]},{"name":"name","type":["string","null"]}]}, 
> "null"]"""
> # ->
> data = [(Row(age=2, name="Alice"),)]
> df = spark.createDataFrame(data, ["value"])
> df.printSchema()
> avro_df = df.select(to_avro(df.value, jsonFormatSchema).alias("value"))
> avro_df.show()
> # <-
> df = avro_df.select(from_avro(avro_df.value, jsonFormatSchema).alias("value"))
> df.show(){code}
> This code renders the following exception:
> {code:java}
> Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Attempting 
> to treat union as a RECORD, but it was: UNION    at 
> org.apache.spark.sql.avro.AvroUtils$.getAvroFieldByName(AvroUtils.scala:223) 
> at 
> org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:333)
>        at 
> org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:76) 
> at 
> org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:56) 
> at 
> org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer$lzycompute(AvroDataToCatalyst.scala:64)
>     at 
> org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer(AvroDataToCatalyst.scala:64)
>        at 
> org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
>       ... 17 more{code}
> From the user perspective it's strange how spark can to_avro a given column 
> with a given schema but when the same schema is provided to from_avro it 
> fails. It seems something associated with structs because I've tested similar 
> schemas from pyspark.sql.avro.functions.to_avro samples, nullable enum in 
> this case, and it works as expected. the samples from 
> pyspark.sql.avro.functions.from_avro are really similar to what I'm trying 
> here but since there we don't pass the jsonFormatSchema on to_avro it works. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to