You might need MARGIN capitalized, this example works though: c <- as.DataFrame(cars) # rename the columns to c1, c2 c <- selectExpr(c, "speed as c1", "dist as c2") cols_in <- dapplyCollect(c, function(x) {apply(x[, paste("c", 1:2, sep = "")], MARGIN=2, FUN = function(y){ y %in% c(61, 99)})}) # dapplyCollect does not require the schema parameter
_____________________________ From: xingye <tracy.up...@gmail.com<mailto:tracy.up...@gmail.com>> Sent: Friday, September 9, 2016 10:35 AM Subject: questions about using dapply To: <user@spark.apache.org<mailto:user@spark.apache.org>> I have a question about using UDF in SparkR. I'm converting some R code into SparkR. * The original R code is : cols_in <- apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99)) * If I use dapply and put the original apply function as a function for dapply, cols_in <-dapply(df, function(x) {apply(x[, paste("cr_cd", 1:12, sep = "")], Margin=2, function(y){ y %in% c(61, 99)})}, schema ) The error shows Error in match.fun(FUN) : argument "FUN" is missing, with no default * If I use spark.lapply, it still shows the error. It seems in spark, the column cr_cd1 is ambiguous. cols_in <-spark.lapply(df[, paste("cr_cd", 1:12, sep = "")], function(x){ x %in% c(61, 99)}) 16/09/08 ERROR RBackendHandler: select on 3101 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Reference 'cr_cd1' is ambiguous, could be: cr_cd1#2169L, cr_cd1#17787L.; * If I use dapplycollect, it works but it will lead to memory issue if data is big. how can the dapply work in my case? wrapper = function(df){ out = apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99)) return(out) } cols_in <-dapplyCollect(df,wrapper)