I have a question about using UDF in SparkR. I’m converting some R code into 
SparkR.
• The original R code is :cols_in <- apply(df[, paste("cr_cd", 1:12, sep = 
"")], MARGIN = 2, FUN = "%in%", c(61, 99))
• If I use dapply and put the original apply function as a function for 
dapply,cols_in <-dapply(df, function(x) {apply(x[, paste("cr_cd", 1:12, sep = 
"")], Margin=2, function(y){ y %in% c(61, 99)})},schema )The error shows Error 
in match.fun(FUN) : argument "FUN" is missing, with no default
• If I use spark.lapply, it still shows the error. It seems in spark, the 
column cr_cd1 is ambiguous.cols_in <-spark.lapply(df[, paste("cr_cd", 1:12, sep 
= "")], function(x){ x %in% c(61, 99)}) 16/09/08 ERROR RBackendHandler: select 
on 3101 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) 
: org.apache.spark.sql.AnalysisException: Reference 'cr_cd1' is ambiguous, 
could be: cr_cd1#2169L, cr_cd1#17787L.;

If I use dapplycollect, it works but it will lead to memory issue if data is 
big. how can the dapply work in my case?wrapper = function(df){out = apply(df[, 
paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99))return(out)
}cols_in <-dapplyCollect(df,wrapper)                                      

Reply via email to