[jira] [Updated] (SPARK-51961) from_avro fails after to_avro working with the same jsonFormatSchema

Rodrigo Cardoso (Jira) Wed, 30 Apr 2025 03:05:06 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-51961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rodrigo Cardoso updated SPARK-51961:
------------------------------------
    Description: 
I was able to reproduce the behavior in summary with the following snippet:
{code:java}
from pyspark.sql import Row, SparkSession
from pyspark.sql.avro.functions import to_avro, from_avro

# Initialize Spark session
spark = (
    SparkSession.builder.appName("AvroSerializationExample")
    .config(
        "spark.jars.packages",
        "org.apache.spark:spark-avro_2.12:3.1.2"
    )
    .getOrCreate()
)

jsonFormatSchema = """[{"type":"record","name":"value","fields":[{"name"
:"age","type":["long","null"]},{"name":"name","type":["string","null"]}]}, 
"null"]"""

# ->
data = [(Row(age=2, name="Alice"),)]
df = spark.createDataFrame(data, ["value"])
df.printSchema()
avro_df = df.select(to_avro(df.value, jsonFormatSchema).alias("value"))
avro_df.show()

# <-
df = avro_df.select(from_avro(avro_df.value, jsonFormatSchema).alias("value"))
df.show(){code}
This code renders the following exception:
{code:java}
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Attempting to 
treat union as a RECORD, but it was: UNION      at 
org.apache.spark.sql.avro.AvroUtils$.getAvroFieldByName(AvroUtils.scala:223) at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:333)
       at 
org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:76) at 
org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:56) at 
org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer$lzycompute(AvroDataToCatalyst.scala:64)
    at 
org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer(AvroDataToCatalyst.scala:64)
       at 
org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
      ... 17 moreFrom the user perspective it's strange how spark can to_avro a 
given column with a given schema but when the same schema is provided to 
from_avro it fails. It seems something associated with structs because I've 
tested similar schemas from pyspark.sql.avro.functions.to_avro samples, 
nullable enum in this case, and it works as expected. the samples from 
pyspark.sql.avro.functions.from_avro are really similar to what I'm trying here 
but since there we don't pass the jsonFormatSchema on to_avro it works. {code}

  was:
I was able to reproduce the behavior in summary with the following snippet:


{code:java}
from pyspark.sql import Row, SparkSession
from pyspark.sql.avro.functions import to_avro, from_avro

# Initialize Spark session
spark = (
    SparkSession.builder.appName("AvroSerializationExample")
    .config(
        "spark.jars.packages",
        "org.apache.spark:spark-avro_2.12:3.1.2"
    )
    .getOrCreate()
)

jsonFormatSchema = """[{"type":"record","name":"value","fields":[{"name"
:"age","type":["long","null"]},{"name":"name","type":["string","null"]}]}, 
"null"]"""

# ->
data = [(Row(age=2, name="Alice"),)]
df = spark.createDataFrame(data, ["value"])
df.printSchema()
avro_df = df.select(to_avro(df.value, jsonFormatSchema).alias("value"))
avro_df.show()

# <-
df = avro_df.select(from_avro(avro_df.value, jsonFormatSchema).alias("value"))
df.show(){code}
This code renders the following exception:
 
 
{code:java}
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Attempting to 
treat union as a RECORD, but it was: UNION      at 
org.apache.spark.sql.avro.AvroUtils$.getAvroFieldByName(AvroUtils.scala:223) at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:333)
       at 
org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:76) at 
org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:56) at 
org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer$lzycompute(AvroDataToCatalyst.scala:64)
    at 
org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer(AvroDataToCatalyst.scala:64)
       at 
org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
      ... 17 more{code}
 
 
>From the user perspective it's strange how spark can to_avro a given column 
>with a given schema but when the same schema is provided to from_avro it 
>fails. It seems something associated with structs because I've tested similar 
>schemas from pyspark.sql.avro.functions.to_avro samples, nullable enum in this 
>case, and it works as expected. the samples from 
>pyspark.sql.avro.functions.from_avro are really similar to what I'm trying 
>here but since there we don't pass the jsonFormatSchema on to_avro it works. 


> from_avro fails after to_avro working with the same jsonFormatSchema
> --------------------------------------------------------------------
>
>                 Key: SPARK-51961
>                 URL: https://issues.apache.org/jira/browse/SPARK-51961
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2
>            Reporter: Rodrigo Cardoso
>            Priority: Major
>
> I was able to reproduce the behavior in summary with the following snippet:
> {code:java}
> from pyspark.sql import Row, SparkSession
> from pyspark.sql.avro.functions import to_avro, from_avro
> # Initialize Spark session
> spark = (
>     SparkSession.builder.appName("AvroSerializationExample")
>     .config(
>         "spark.jars.packages",
>         "org.apache.spark:spark-avro_2.12:3.1.2"
>     )
>     .getOrCreate()
> )
> jsonFormatSchema = """[{"type":"record","name":"value","fields":[{"name"
> :"age","type":["long","null"]},{"name":"name","type":["string","null"]}]}, 
> "null"]"""
> # ->
> data = [(Row(age=2, name="Alice"),)]
> df = spark.createDataFrame(data, ["value"])
> df.printSchema()
> avro_df = df.select(to_avro(df.value, jsonFormatSchema).alias("value"))
> avro_df.show()
> # <-
> df = avro_df.select(from_avro(avro_df.value, jsonFormatSchema).alias("value"))
> df.show(){code}
> This code renders the following exception:
> {code:java}
> Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Attempting 
> to treat union as a RECORD, but it was: UNION    at 
> org.apache.spark.sql.avro.AvroUtils$.getAvroFieldByName(AvroUtils.scala:223) 
> at 
> org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:333)
>        at 
> org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:76) 
> at 
> org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:56) 
> at 
> org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer$lzycompute(AvroDataToCatalyst.scala:64)
>     at 
> org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer(AvroDataToCatalyst.scala:64)
>        at 
> org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
>       ... 17 moreFrom the user perspective it's strange how spark can to_avro 
> a given column with a given schema but when the same schema is provided to 
> from_avro it fails. It seems something associated with structs because I've 
> tested similar schemas from pyspark.sql.avro.functions.to_avro samples, 
> nullable enum in this case, and it works as expected. the samples from 
> pyspark.sql.avro.functions.from_avro are really similar to what I'm trying 
> here but since there we don't pass the jsonFormatSchema on to_avro it works. 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-51961) from_avro fails after to_avro working with the same jsonFormatSchema

Reply via email to