Shivaram Venkataraman created SPARK-6822: --------------------------------------------
Summary: lapplyPartition passes empty list to function Key: SPARK-6822 URL: https://issues.apache.org/jira/browse/SPARK-6822 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman I have an rdd containing two elements, as expected or as shown by a collect. When I call lapplyPartition on it with a function that prints its arguments in stderr, the function gets called three times, the first two with the expected arguments and the third with an empty list as argument. I was wondering if that's a bug or if there are conditions under which that's possible. I apologize I don't have a simple test case ready yet. I run into this potential bug developing a separate package, plyrmr. If you are willing to install it, the test case is very simple. The rdd that creates this problem is a result of a join, but I couldn't replicate the problem using a plain vanilla join. Example from Antonio on SparkR JIRA: I don't have time to try any harder to repro this without plyrmr. For the record this is the example {code} library(plyrmr) plyrmr.options(backend = "spark") df1 = mtcars[1:4,] df2 = mtcars[3:6,] w = as.data.frame(gapply(merge(input(df1), input(df2)), identity)) {code} the gapply is implemented with a lapplyPartition in most cases. The merge with a join. as.data.frame with a collect. The join has an arbitrary argument of 4 partitions. If I turn that down to 2L, the problem disappears. I will check in a version with a workaround in place but a debugging statement will leave a record in stderr whenever the workaround kicks in, so that we can track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org