[ https://issues.apache.org/jira/browse/SPARK-51961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rodrigo Cardoso updated SPARK-51961: ------------------------------------ Description: I was able to reproduce the behavior in summary with the following snippet: {code:java} from pyspark.sql import Row, SparkSession from pyspark.sql.avro.functions import to_avro, from_avro # Initialize Spark session spark = ( SparkSession.builder.appName("AvroSerializationExample") .config( "spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.1.2" ) .getOrCreate() ) jsonFormatSchema = """[{"type":"record","name":"value","fields":[{"name" :"age","type":["long","null"]},{"name":"name","type":["string","null"]}]}, "null"]""" # -> data = [(Row(age=2, name="Alice"),)] df = spark.createDataFrame(data, ["value"]) df.printSchema() avro_df = df.select(to_avro(df.value, jsonFormatSchema).alias("value")) avro_df.show() # <- df = avro_df.select(from_avro(avro_df.value, jsonFormatSchema).alias("value")) df.show(){code} This code renders the following exception: {code:java} Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Attempting to treat union as a RECORD, but it was: UNION at org.apache.spark.sql.avro.AvroUtils$.getAvroFieldByName(AvroUtils.scala:223) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:333) at org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:76) at org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:56) at org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer$lzycompute(AvroDataToCatalyst.scala:64) at org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer(AvroDataToCatalyst.scala:64) at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101) ... 17 moreFrom the user perspective it's strange how spark can to_avro a given column with a given schema but when the same schema is provided to from_avro it fails. It seems something associated with structs because I've tested similar schemas from pyspark.sql.avro.functions.to_avro samples, nullable enum in this case, and it works as expected. the samples from pyspark.sql.avro.functions.from_avro are really similar to what I'm trying here but since there we don't pass the jsonFormatSchema on to_avro it works. {code} was: I was able to reproduce the behavior in summary with the following snippet: {code:java} from pyspark.sql import Row, SparkSession from pyspark.sql.avro.functions import to_avro, from_avro # Initialize Spark session spark = ( SparkSession.builder.appName("AvroSerializationExample") .config( "spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.1.2" ) .getOrCreate() ) jsonFormatSchema = """[{"type":"record","name":"value","fields":[{"name" :"age","type":["long","null"]},{"name":"name","type":["string","null"]}]}, "null"]""" # -> data = [(Row(age=2, name="Alice"),)] df = spark.createDataFrame(data, ["value"]) df.printSchema() avro_df = df.select(to_avro(df.value, jsonFormatSchema).alias("value")) avro_df.show() # <- df = avro_df.select(from_avro(avro_df.value, jsonFormatSchema).alias("value")) df.show(){code} This code renders the following exception: {code:java} Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Attempting to treat union as a RECORD, but it was: UNION at org.apache.spark.sql.avro.AvroUtils$.getAvroFieldByName(AvroUtils.scala:223) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:333) at org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:76) at org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:56) at org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer$lzycompute(AvroDataToCatalyst.scala:64) at org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer(AvroDataToCatalyst.scala:64) at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101) ... 17 more{code} >From the user perspective it's strange how spark can to_avro a given column >with a given schema but when the same schema is provided to from_avro it >fails. It seems something associated with structs because I've tested similar >schemas from pyspark.sql.avro.functions.to_avro samples, nullable enum in this >case, and it works as expected. the samples from >pyspark.sql.avro.functions.from_avro are really similar to what I'm trying >here but since there we don't pass the jsonFormatSchema on to_avro it works. > from_avro fails after to_avro working with the same jsonFormatSchema > -------------------------------------------------------------------- > > Key: SPARK-51961 > URL: https://issues.apache.org/jira/browse/SPARK-51961 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.2 > Reporter: Rodrigo Cardoso > Priority: Major > > I was able to reproduce the behavior in summary with the following snippet: > {code:java} > from pyspark.sql import Row, SparkSession > from pyspark.sql.avro.functions import to_avro, from_avro > # Initialize Spark session > spark = ( > SparkSession.builder.appName("AvroSerializationExample") > .config( > "spark.jars.packages", > "org.apache.spark:spark-avro_2.12:3.1.2" > ) > .getOrCreate() > ) > jsonFormatSchema = """[{"type":"record","name":"value","fields":[{"name" > :"age","type":["long","null"]},{"name":"name","type":["string","null"]}]}, > "null"]""" > # -> > data = [(Row(age=2, name="Alice"),)] > df = spark.createDataFrame(data, ["value"]) > df.printSchema() > avro_df = df.select(to_avro(df.value, jsonFormatSchema).alias("value")) > avro_df.show() > # <- > df = avro_df.select(from_avro(avro_df.value, jsonFormatSchema).alias("value")) > df.show(){code} > This code renders the following exception: > {code:java} > Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Attempting > to treat union as a RECORD, but it was: UNION at > org.apache.spark.sql.avro.AvroUtils$.getAvroFieldByName(AvroUtils.scala:223) > at > org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:333) > at > org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:76) > at > org.apache.spark.sql.avro.AvroDeserializer.<init>(AvroDeserializer.scala:56) > at > org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer$lzycompute(AvroDataToCatalyst.scala:64) > at > org.apache.spark.sql.avro.AvroDataToCatalyst.deserializer(AvroDataToCatalyst.scala:64) > at > org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101) > ... 17 moreFrom the user perspective it's strange how spark can to_avro > a given column with a given schema but when the same schema is provided to > from_avro it fails. It seems something associated with structs because I've > tested similar schemas from pyspark.sql.avro.functions.to_avro samples, > nullable enum in this case, and it works as expected. the samples from > pyspark.sql.avro.functions.from_avro are really similar to what I'm trying > here but since there we don't pass the jsonFormatSchema on to_avro it works. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org