[ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Philip Adetiloye updated SPARK-43201: ------------------------------------- Description: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) parsed.show() // Output: // +------------+ // | parsedData| // +------------+ // |[apple1, 1.0]| // |[apple2, 2.0]| // +------------+ {code} was: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show() // Output: // +------------+ // | parsedData| // +------------+ // |[apple1, 1.0]| // |[apple2, 2.0]| // +------------+ {code} > Inconsistency between from_avro and from_json function > ------------------------------------------------------ > > Key: SPARK-43201 > URL: https://issues.apache.org/jira/browse/SPARK-43201 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.4.0 > Reporter: Philip Adetiloye > Priority: Major > > Spark from_avro function does not allow schema to use dataframe column but > takes a String schema: > {code:java} > def from_avro(col: Column, jsonFormatSchema: String): Column {code} > This makes it impossible to deserialize rows of Avro records with different > schema since only one schema string could be pass externally. > > Here is what I would expect: > {code:java} > def from_avro(col: Column, jsonFormatSchema: Column): Column {code} > code example: > {code:java} > import org.apache.spark.sql.functions.from_avro > val avroSchema1 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > > val avroSchema2 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > val df = Seq( > (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), > (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) > ).toDF("binaryData", "schema") > val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) > parsed.show() > // Output: > // +------------+ > // | parsedData| > // +------------+ > // |[apple1, 1.0]| > // |[apple2, 2.0]| > // +------------+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org