[GitHub] [spark] Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-12-10 Thread GitBox
Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas 
URL: https://github.com/apache/spark/pull/26780#discussion_r356433990
 
 

 ##
 File path: python/pyspark/sql/avro/functions.py
 ##
 @@ -30,9 +30,10 @@
 @since(3.0)
 def from_avro(data, jsonFormatSchema, options={}):
 """
-Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-arbitrary result.
+Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
 
 Review comment:
   Thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-12-10 Thread GitBox
Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas 
URL: https://github.com/apache/spark/pull/26780#discussion_r356433702
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala
 ##
 @@ -45,9 +45,10 @@ object functions {
   }
 
   /**
-   * Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-   * schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-   * arbitrary result.
+   * Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
+   * schema is provided in the options, a different (but compatible) schema 
can be used for reading.
 
 Review comment:
   Good point, thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-12-10 Thread GitBox
Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas 
URL: https://github.com/apache/spark/pull/26780#discussion_r356249061
 
 

 ##
 File path: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala
 ##
 @@ -153,4 +153,79 @@ class AvroFunctionsSuite extends QueryTest with 
SharedSparkSession {
   assert(df.collect().map(_.get(0)) === Seq(Row("one"), Row("two"), 
Row("three"), Row("four")))
 }
   }
+
+  test("SPARK-27506: roundtrip in to_avro and from_avro with different 
compatible schemas") {
+val df = spark.range(10).select(
+  struct('id.as("col1"), 'id.cast("string").as("col2")).as("struct")
+)
+val avroStructDF = df.select(functions.to_avro('struct).as("avro"))
+val actualAvroSchema = s"""
+  |{
+  |  "type": "record",
+  |  "name": "struct",
+  |  "fields": [
+  |{"name": "col1", "type": "int"},
+  |{"name": "col2", "type": "string"}
+  |  ]
+  |}
 
 Review comment:
   I've updated the PR, feel free to comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-12-10 Thread GitBox
Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas 
URL: https://github.com/apache/spark/pull/26780#discussion_r356248649
 
 

 ##
 File path: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala
 ##
 @@ -153,4 +153,79 @@ class AvroFunctionsSuite extends QueryTest with 
SharedSparkSession {
   assert(df.collect().map(_.get(0)) === Seq(Row("one"), Row("two"), 
Row("three"), Row("four")))
 }
   }
+
+  test("SPARK-27506: roundtrip in to_avro and from_avro with different 
compatible schemas") {
+val df = spark.range(10).select(
+  struct('id.as("col1"), 'id.cast("string").as("col2")).as("struct")
+)
+val avroStructDF = df.select(functions.to_avro('struct).as("avro"))
+val actualAvroSchema = s"""
+  |{
+  |  "type": "record",
+  |  "name": "struct",
+  |  "fields": [
+  |{"name": "col1", "type": "int"},
+  |{"name": "col2", "type": "string"}
+  |  ]
+  |}
 
 Review comment:
   I think it should be as:
   ```
   val msg =
 s"""
|{
|
|
|}
|""".stripMargin
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-12-10 Thread GitBox
Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas 
URL: https://github.com/apache/spark/pull/26780#discussion_r356247604
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 ##
 @@ -51,9 +51,13 @@ case class AvroDataToCatalyst(
 
   override def nullable: Boolean = true
 
+  private lazy val avroOptions = AvroOptions(options)
+
   @transient private lazy val avroSchema = new 
Schema.Parser().parse(jsonFormatSchema)
 
-  @transient private lazy val reader = new GenericDatumReader[Any](avroSchema)
+  @transient private lazy val reader = avroOptions.actualSchema
+.map(writer => new GenericDatumReader[Any](new 
Schema.Parser().parse(writer), avroSchema))
 
 Review comment:
   actualSchema* and updated :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-12-10 Thread GitBox
Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas 
URL: https://github.com/apache/spark/pull/26780#discussion_r356200877
 
 

 ##
 File path: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala
 ##
 @@ -153,4 +153,79 @@ class AvroFunctionsSuite extends QueryTest with 
SharedSparkSession {
   assert(df.collect().map(_.get(0)) === Seq(Row("one"), Row("two"), 
Row("three"), Row("four")))
 }
   }
+
+  test("SPARK-27506: roundtrip in to_avro and from_avro with different 
compatible schemas") {
+val df = spark.range(10).select(
+  struct('id.as("col1"), 'id.cast("string").as("col2")).as("struct")
+)
+val avroStructDF = df.select(functions.to_avro('struct).as("avro"))
+val actualAvroSchema = s"""
+  |{
+  |  "type": "record",
+  |  "name": "struct",
+  |  "fields": [
+  |{"name": "col1", "type": "int"},
+  |{"name": "col2", "type": "string"}
+  |  ]
+  |}
 
 Review comment:
   I've also looked through the codebase of Spark, but it is rather incoherent. 
Please let me know if you want to change it, and I'll update it right away.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-12-10 Thread GitBox
Fokko commented on a change in pull request #26780: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas 
URL: https://github.com/apache/spark/pull/26780#discussion_r356199894
 
 

 ##
 File path: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala
 ##
 @@ -153,4 +153,79 @@ class AvroFunctionsSuite extends QueryTest with 
SharedSparkSession {
   assert(df.collect().map(_.get(0)) === Seq(Row("one"), Row("two"), 
Row("three"), Row("four")))
 }
   }
+
+  test("SPARK-27506: roundtrip in to_avro and from_avro with different 
compatible schemas") {
+val df = spark.range(10).select(
+  struct('id.as("col1"), 'id.cast("string").as("col2")).as("struct")
+)
+val avroStructDF = df.select(functions.to_avro('struct).as("avro"))
+val actualAvroSchema = s"""
+  |{
+  |  "type": "record",
+  |  "name": "struct",
+  |  "fields": [
+  |{"name": "col1", "type": "int"},
+  |{"name": "col2", "type": "string"}
+  |  ]
+  |}
 
 Review comment:
   I've fixed this, but something looks off with Github


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org