Hello all, I just started testing Spark R 2.0, and find the execution of dapply very slow.
For example, using R, the following code set.seed(2) random_DF<-data.frame(matrix(rnorm(1000000),100000,10)) system.time(dummy_res<-random_DF[random_DF[,1]>1,]) user system elapsed 0.005 0.000 0.006 is executed in 6ms Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get: sparkR.session(master = "local[4]") random_DF_Spark <- repartition(createDataFrame(random_DF),4) subset_DF_Spark <- dapply( random_DF_Spark, function(x) { y <- x[x[1] > 1, ] y }, schema(random_DF_Spark)) system.time(dummy_res_Spark<-collect(subset_DF_Spark)) user system elapsed 2.003 0.119 62.919 I.e. 1 minute, which is abnormally slow.... Am I missing something? I get also a warning (16/07/31 15:07:02 WARN TaskSetManager: Stage 64 contains a task of very large size (16411 KB). The maximum recommended task size is 100 KB.). Why is this 100KB limit so low? I am using R 3.3.0 on Mac OS 10.10.5 Any insight welcome, Best, Yann-Aël -- ========================================= Yann-Aël Le Borgne Machine Learning Group Université Libre de Bruxelles http://mlg.ulb.ac.be http://www.ulb.ac.be/di/map/yleborgn =========================================