[GitHub] [spark] HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-12-05 Thread GitBox
HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] 
Allow deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r354183693
 
 

 ##
 File path: docs/sql-data-sources-avro.md
 ##
 @@ -240,6 +240,14 @@ Data source options of Avro can be set via:
 
 function from_avro
   
+  
+writerSchema
 
 Review comment:
   Yeah, actually `writerSchema` name is super confusing to me. Can we use 
`actualSchema`? I would prefer this one more.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-22 Thread GitBox
HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] 
Allow deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r316957141
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 ##
 @@ -53,7 +54,9 @@ case class AvroDataToCatalyst(
 
   @transient private lazy val avroSchema = new 
Schema.Parser().parse(jsonFormatSchema)
 
-  @transient private lazy val reader = new GenericDatumReader[Any](avroSchema)
+  @transient private lazy val reader = writerJsonFormatSchema
 
 Review comment:
   I am saying it doesn't look clear unless you read the whole PR description 
and JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-22 Thread GitBox
HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] 
Allow deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r316956886
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala
 ##
 @@ -28,39 +28,59 @@ object functions {
 // scalastyle:on: object.name
 
   /**
-   * Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-   * schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-   * arbitrary result.
+   * Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
+   * schema is provided, a different (but compatible) schema can be used for 
reading. If no writer's
+   * schema is provided, the specified schema must match the read data, 
otherwise the behavior is
+   * undefined: it may fail or return arbitrary result.
*
* @param data the binary column.
* @param jsonFormatSchema the avro schema in JSON string format.
+   * @param writerJsonFormatSchema the avro schema in JSON string format used 
to serialize the data.
*
* @since 3.0.0
*/
   @Experimental
   def from_avro(
   data: Column,
-  jsonFormatSchema: String): Column = {
-new Column(AvroDataToCatalyst(data.expr, jsonFormatSchema, Map.empty))
+  jsonFormatSchema: String,
+  writerJsonFormatSchema: Option[String]): Column = {
 
 Review comment:
   Is it a Java native library? or Scala's?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-13 Thread GitBox
HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] 
Allow deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r313480798
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 ##
 @@ -33,7 +33,8 @@ import org.apache.spark.sql.types._
 case class AvroDataToCatalyst(
 child: Expression,
 jsonFormatSchema: String,
-options: Map[String, String])
+options: Map[String, String],
+writerJsonFormatSchema: Option[String] = None)
 
 Review comment:
   Can we rename it with proper docs? Two avro schemas look super confusing


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-13 Thread GitBox
HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] 
Allow deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r313480598
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 ##
 @@ -53,7 +54,9 @@ case class AvroDataToCatalyst(
 
   @transient private lazy val avroSchema = new 
Schema.Parser().parse(jsonFormatSchema)
 
-  @transient private lazy val reader = new GenericDatumReader[Any](avroSchema)
+  @transient private lazy val reader = writerJsonFormatSchema
 
 Review comment:
   It's `writerJsonFormatSchema` but why we need to set in reader?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-13 Thread GitBox
HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] 
Allow deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r313477197
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala
 ##
 @@ -28,39 +28,59 @@ object functions {
 // scalastyle:on: object.name
 
   /**
-   * Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-   * schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-   * arbitrary result.
+   * Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
+   * schema is provided, a different (but compatible) schema can be used for 
reading. If no writer's
+   * schema is provided, the specified schema must match the read data, 
otherwise the behavior is
+   * undefined: it may fail or return arbitrary result.
*
* @param data the binary column.
* @param jsonFormatSchema the avro schema in JSON string format.
+   * @param writerJsonFormatSchema the avro schema in JSON string format used 
to serialize the data.
 
 Review comment:
   Can you describe some copy-and-paste-able example? you can do it via:
   
   ```
   {{{
 ...
   }}}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-13 Thread GitBox
HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] 
Allow deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r313477318
 
 

 ##
 File path: python/pyspark/sql/avro/functions.py
 ##
 @@ -57,9 +59,13 @@ def from_avro(data, jsonFormatSchema, options={}):
 """
 
 Review comment:
   Can we add a doctest?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-13 Thread GitBox
HyukjinKwon commented on a change in pull request #24405: [SPARK-27506][SQL] 
Allow deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r313471760
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala
 ##
 @@ -28,39 +28,59 @@ object functions {
 // scalastyle:on: object.name
 
   /**
-   * Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-   * schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-   * arbitrary result.
+   * Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
+   * schema is provided, a different (but compatible) schema can be used for 
reading. If no writer's
+   * schema is provided, the specified schema must match the read data, 
otherwise the behavior is
+   * undefined: it may fail or return arbitrary result.
*
* @param data the binary column.
* @param jsonFormatSchema the avro schema in JSON string format.
+   * @param writerJsonFormatSchema the avro schema in JSON string format used 
to serialize the data.
*
* @since 3.0.0
*/
   @Experimental
   def from_avro(
   data: Column,
-  jsonFormatSchema: String): Column = {
-new Column(AvroDataToCatalyst(data.expr, jsonFormatSchema, Map.empty))
+  jsonFormatSchema: String,
+  writerJsonFormatSchema: Option[String]): Column = {
 
 Review comment:
   `Option` is Scala specific one. I think we need another overridden version 
for Java to use it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org