thisisnic commented on code in PR #12799:
URL: https://github.com/apache/arrow/pull/12799#discussion_r843723852
##########
r/R/dataset-write.R:
##########
@@ -153,6 +153,16 @@ write_dataset <- function(dataset,
}
}
+ if (!missing(max_rows_per_file) && max_rows_per_group > max_rows_per_file) {
+ if (!missing(max_rows_per_group)) {
+ warning(paste0(c(
+ "'max_rows_per_group' must be less or equal to 'max_rows_per_file'.",
+ "\n'max_rows_per_group' set to value of 'max_rows_per_file'."
Review Comment:
```suggestion
"`max_rows_per_group` must be less or equal to `max_rows_per_file`.",
"\n`max_rows_per_group` set to value of `max_rows_per_file`."
```
Nice informative warning here. Typically, we use backticks (```) to refer
to parameter values in these kinds of messages.
How about though, we adjust the conditions under which the change in line
163 below is triggered to only include when `max_rows_per_file` has been set
(i.e. isn't missing) *and* `max_rows_per_group` hasn't been changed from the
default (i.e. is "missing")?
I'm hesitant to make too many assumptions about user intentions, but I think
we could safely say here that it's in those circumstances the user most likely
wants to set the maximum rows per file and just hasn't paid attention to the
`max_rows_per_group` parameter, and so we can just update the value silently
with no warning.
##########
r/tests/testthat/test-dataset-write.R:
##########
@@ -506,6 +506,47 @@ test_that("Max partitions fails with non-integer values
and less than required p
)
})
+test_that("max_rows_per_group is adjusted if at odds with max_rows_per_file", {
+ skip_if_not_available("parquet")
+ df <- tibble::tibble(
+ int = 1:10,
+ dbl = as.numeric(1:10),
+ lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
+ chr = letters[1:10],
+ )
+ dst_dir <- make_temp_dir()
+
+ # max_rows_per_group unset => pass
+ expect_silent(
+ write_dataset(df, dst_dir, max_rows_per_file = 5)
+ )
+
+ expect_equal(
+ {
+ write_dataset(df, dst_dir, max_rows_per_file = 5)
+ list.files(dst_dir, "part-") %>%
+ length()
+ },
+ 2
+ )
Review Comment:
Great attention to detail, but I think we can remove this test, as it's a
little out of scope of this ticket - it's basically testing that
`max_rows_per_file` works as intended (which should be tested in the C++ layer
anyway), rather than the specific thing this ticket addresses (adjusting the
behaviour when the user specified `max_rows_per_file` at odds with
`max_rows_per_group`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]