Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12493#discussion_r60452743
  
    --- Diff: R/pkg/R/generics.R ---
    @@ -439,6 +439,10 @@ setGeneric("covar_samp", function(col1, col2) 
{standardGeneric("covar_samp") })
     #' @export
     setGeneric("covar_pop", function(col1, col2) {standardGeneric("covar_pop") 
})
     
    +#' @rdname dapply
    +#' @export
    +setGeneric("dapply", function(x, func, schema = NULL) { 
standardGeneric("dapply") })
    --- End diff --
    
    I think there are two main use cases we want to support
    1. dapply and the collect - in this case not giving a schema reduces 
user-burden and we should be able to deserialize the results of this in the 
driver. 
    2. dapply and then do more DataFrame operations - in this case as @davies 
says not having a schema makes it very limited. 
    
    Thus my take from a API perspective would be to write functions 
`dapplyCollect` which has optional schema and `dapply` which has required 
schema and internally we then handle either.
     
    Regarding chaining `dapply` calls, we should make the optimizer merge these 
and call `R` only once (will amortize the overhead of launching R etc. similar 
to PipelinedRDD)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to