Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95514615 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- Hmm, good point, the behavior is a bit strange and I haven't thought of a concise way to document this. https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L131 Basically, it's the largest of numSlices or ceiling of data size divided by `spark.r.maxAllocationLimit` - *but* limited by length of the data (but this length is wrong if the data is a data.frame - since that length becomes the number of columns). Is this an unintentional behavior (ie. limited always by the number of columns even when the data size is larger then the `spark.r.maxAllocationLimit`)? I can't tell...
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org