Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/12493#discussion_r60452743 --- Diff: R/pkg/R/generics.R --- @@ -439,6 +439,10 @@ setGeneric("covar_samp", function(col1, col2) {standardGeneric("covar_samp") }) #' @export setGeneric("covar_pop", function(col1, col2) {standardGeneric("covar_pop") }) +#' @rdname dapply +#' @export +setGeneric("dapply", function(x, func, schema = NULL) { standardGeneric("dapply") }) --- End diff -- I think there are two main use cases we want to support 1. dapply and the collect - in this case not giving a schema reduces user-burden and we should be able to deserialize the results of this in the driver. 2. dapply and then do more DataFrame operations - in this case as @davies says not having a schema makes it very limited. Thus my take from a API perspective would be to write functions `dapplyCollect` which has optional schema and `dapply` which has required schema and internally we then handle either. Regarding chaining `dapply` calls, we should make the optimizer merge these and call `R` only once (will amortize the overhead of launching R etc. similar to PipelinedRDD)
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org