[
https://issues.apache.org/jira/browse/SPARK-51961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-51961:
-----------------------------------
Labels: pull-request-available (was: )
> from_avro fails after to_avro working with the same jsonFormatSchema
> --------------------------------------------------------------------
>
> Key: SPARK-51961
> URL: https://issues.apache.org/jira/browse/SPARK-51961
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.2
> Reporter: Rodrigo Cardoso
> Priority: Major
> Labels: pull-request-available
>
> I was able to reproduce the behavior in summary with the following snippet:
> {code:java}
> from pyspark.sql import Row, SparkSession
> from pyspark.sql.avro.functions import to_avro, from_avro
> # Initialize Spark session
> spark = (
> SparkSession.builder.appName("AvroSerializationExample")
> .config(
> "spark.jars.packages",
> "org.apache.spark:spark-avro_2.12:3.1.2"
> )
> .getOrCreate()
> )
> jsonFormatSchema = """[{"type":"record","name":"value","fields":[{"name"
> :"age","type":["long","null"]},{"name":"name","type":["string","null"]}]},
> "null"]"""
> # ->
> data = [(Row(age=2, name="Alice"),)]
> df = spark.createDataFrame(data, ["value"])
> df.printSchema()
> avro_df = df.select(to_avro(df.value, jsonFormatSchema).alias("value"))
> avro_df.show()
> # <-
> df = avro_df.select(from_avro(avro_df.value, jsonFormatSchema).alias("value"))
> df.show(){code}
> This code renders the following exception:
> {code:java}
> Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Attempting
> to treat union as a RECORD, but it was: UNION at
> org.apache.spark.sql.avro.AvroUtils$.getAvroFieldByName(AvroUtils.scala:223)
> at
> org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:333)
> at
> org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:76)
> at
> org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:56)
> at
> org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer$lzycompute(AvroDataToCatalyst.scala:64)
> at
> org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer(AvroDataToCatalyst.scala:64)
> at
> org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
> ... 17 more{code}
> From the user perspective it's strange how spark can to_avro a given column
> with a given schema but when the same schema is provided to from_avro it
> fails. It seems something associated with structs because I've tested similar
> schemas from pyspark.sql.avro.functions.to_avro samples, nullable enum in
> this case, and it works as expected. the samples from
> pyspark.sql.avro.functions.from_avro are really similar to what I'm trying
> here but since there we don't pass the jsonFormatSchema on to_avro it works.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]