[GitHub] [spark] HyukjinKwon commented on a change in pull request #30216: [SPARK-33304][R][SQL] Add from_avro and to_avro functions to SparkR

GitBox Tue, 03 Nov 2020 06:04:50 -0800


HyukjinKwon commented on a change in pull request #30216:
URL: https://github.com/apache/spark/pull/30216#discussion_r515852857




##########
File path: R/pkg/R/functions_avro.R
##########
@@ -0,0 +1,117 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#' Avro processing functions for Column operations
+#'
+#' Avro processing functions defined for \code{Column}.
+#'
+#' @param x Column to compute on.
+#' @param jsonFormatSchema character Avro schema in JSON string format
+#' @param ... additional argument(s) passed as parser options.
+#' @name column_avro_functions
+#' @rdname column_avro_functions
+#' @family avro functions
+#' @note Avro is built-in but external data source module since Spark 2.4.
+#'   Please deploy the application as per the deployment section of "Apache 
Avro Data Source Guide".
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(iris)
+#' schema <- paste(
+#'   c(
+#'     '{"type": "record", "namespace": "example.avro", "name": "Iris", 
"fields": [',
+#'     '{"type": ["double", "null"], "name": "Sepal_Length"},',
+#'     '{"type": ["double", "null"], "name": "Sepal_Width"},',
+#'     '{"type": ["double", "null"], "name": "Petal_Length"},',
+#'     '{"type": ["double", "null"], "name": "Petal_Width"},',
+#'     '{"type": ["string", "null"], "name": "Species"}]}'
+#'   ),
+#'   collapse="\\n"
+#' )
+#'
+#' df_serialized <- select(
+#'   df,
+#'   alias(to_avro(alias(struct(column("*")), "fields")), "payload")
+#' )
+#'
+#' df_deserialized <- select(
+#'   df_serialized,
+#'   from_avro(df_serialized$payload, schema)
+#' )
+#'
+#' head(df_deserialized)
+#' }
+NULL
+
+#' @include generics.R column.R
+NULL
+
+#' @details
+#' \code{from_avro} Converts a binary column of Avro format into its 
corresponding catalyst value.
+#' The specified schema must match the read data, otherwise the behavior is 
undefined:
+#' it may fail or return arbitrary result.
+#' To deserialize the data with a compatible and evolved schema, the expected 
Avro schema can be
+#' set via the option avroSchema.

Review comment:
       Oh no. I think that's fine but we'll have to add the R example in the 
link I pointed out.

##########
File path: R/pkg/DESCRIPTION
##########
@@ -42,6 +42,7 @@ Collate:
     'context.R'
     'deserialize.R'
     'functions.R'
+    'functions_avro.R'

Review comment:
       Separating them sounds like an idea, yes. But for now I think we should 
just put it in the same place; otherwise, it would make more sense to place ML 
ones separately to follow the structure in PySpark side. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #30216: [SPARK-33304][R][SQL] Add from_avro and to_avro functions to SparkR

Reply via email to