[GitHub] [arrow] jonkeane commented on a change in pull request #9748: ARROW-11729: [R] Add examples to datasets documentation

GitBox Thu, 26 Aug 2021 13:14:38 -0700


jonkeane commented on a change in pull request #9748:
URL: https://github.com/apache/arrow/pull/9748#discussion_r696941059




##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,46 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples

Review comment:
       Now that we are using it, we should upgrade this to be @examplesIf as in 
other places: https://github.com/apache/arrow/blob/master/r/R/dataset.R#L84
   
   ```suggestion
   #' @examplesIf arrow_with_dataset() & arrow_with_parquet() & 
requireNamespace("dplyr", quietly = TRUE)
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,46 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # You can write datasets partitioned by the values in a column (here: 
"cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' one_level_tree<- tempdir()
+#' write_dataset(mtcars, one_level_tree, partitioning = "cyl")
+#' list.files(one_level_tree, recursive = TRUE)
+#'
+#' # You can also partition by the values in multiple columns
+#' # (here: "cyl" and "gear").
+#' # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
+#' two_levels_tree <- tempdir()
+#' write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
+#' list.files(two_levels_tree, recursive = TRUE)
+#'
+#' # In the two previous examples we would have:
+#' # X = \{4,6,8\}, the number of cylinders.
+#' # Y = \{3,4,5\}, the number of forward gears.
+#' # Z = \{0,1,2\}, the number of saved parts, starting from 0.
+#'
+#' # You can obtain the same result as as the previous examples by combining
+#' # both arrow and dplyr.
+#'
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'
+#'  # Write a structure cyl=X/gear=Y/part-Z.parquet.
+#'  two_levels_tree_2 <- tempfile()
+#'  d %>% write_dataset(two_levels_tree_2)
+#'  list.files(two_levels_tree_2, recursive = TRUE)

Review comment:
       Should we add a note here that this will be the same (with the exception 
of the base temp-directory) as `two_levels_tree` above?

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,46 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # You can write datasets partitioned by the values in a column (here: 
"cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' one_level_tree<- tempdir()
+#' write_dataset(mtcars, one_level_tree, partitioning = "cyl")
+#' list.files(one_level_tree, recursive = TRUE)
+#'
+#' # You can also partition by the values in multiple columns
+#' # (here: "cyl" and "gear").
+#' # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
+#' two_levels_tree <- tempdir()
+#' write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
+#' list.files(two_levels_tree, recursive = TRUE)
+#'
+#' # In the two previous examples we would have:
+#' # X = \{4,6,8\}, the number of cylinders.
+#' # Y = \{3,4,5\}, the number of forward gears.
+#' # Z = \{0,1,2\}, the number of saved parts, starting from 0.
+#'
+#' # You can obtain the same result as as the previous examples by combining
+#' # both arrow and dplyr.
+#'
+#' if(requireNamespace("dplyr", quietly = TRUE)) {

Review comment:
       We should put this up in the @examplesIf, most people will have it 
installed so there's not much of a set of people who will be unable to run the 
examples and get output from the content before this but then not after this.

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,46 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # You can write datasets partitioned by the values in a column (here: 
"cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' one_level_tree<- tempdir()

Review comment:
       ```suggestion
   #' one_level_tree <- tempdir()
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,46 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # You can write datasets partitioned by the values in a column (here: 
"cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' one_level_tree<- tempdir()

Review comment:
       Also, should this be `tempfile()` like you use below, so that it's a 
new, unique path every time?

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,46 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # You can write datasets partitioned by the values in a column (here: 
"cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' one_level_tree<- tempdir()
+#' write_dataset(mtcars, one_level_tree, partitioning = "cyl")
+#' list.files(one_level_tree, recursive = TRUE)
+#'
+#' # You can also partition by the values in multiple columns
+#' # (here: "cyl" and "gear").
+#' # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
+#' two_levels_tree <- tempdir()
+#' write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
+#' list.files(two_levels_tree, recursive = TRUE)
+#'
+#' # In the two previous examples we would have:
+#' # X = \{4,6,8\}, the number of cylinders.
+#' # Y = \{3,4,5\}, the number of forward gears.
+#' # Z = \{0,1,2\}, the number of saved parts, starting from 0.
+#'
+#' # You can obtain the same result as as the previous examples by combining
+#' # both arrow and dplyr.

Review comment:
       I would change "by combining both arrow and dplyr" to "using arrow with 
a dplyr pipeline:"

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,46 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # You can write datasets partitioned by the values in a column (here: 
"cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' one_level_tree<- tempdir()
+#' write_dataset(mtcars, one_level_tree, partitioning = "cyl")
+#' list.files(one_level_tree, recursive = TRUE)
+#'
+#' # You can also partition by the values in multiple columns
+#' # (here: "cyl" and "gear").
+#' # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
+#' two_levels_tree <- tempdir()
+#' write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
+#' list.files(two_levels_tree, recursive = TRUE)
+#'
+#' # In the two previous examples we would have:
+#' # X = \{4,6,8\}, the number of cylinders.
+#' # Y = \{3,4,5\}, the number of forward gears.
+#' # Z = \{0,1,2\}, the number of saved parts, starting from 0.
+#'
+#' # You can obtain the same result as as the previous examples by combining
+#' # both arrow and dplyr.
+#'
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'
+#'  # Write a structure cyl=X/gear=Y/part-Z.parquet.
+#'  two_levels_tree_2 <- tempfile()
+#'  d %>% write_dataset(two_levels_tree_2)
+#'  list.files(two_levels_tree_2, recursive = TRUE)
+#' }
+#'
+#' # And you can also turn off the Hive-style directory naming where the column
+#' # name is included with the values by using `hive_style = FALSE`.
+#'
+#' if(requireNamespace("dplyr", quietly = TRUE)) {

Review comment:
       Again, remove this `if()` wrapping

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,46 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # You can write datasets partitioned by the values in a column (here: 
"cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' one_level_tree<- tempdir()
+#' write_dataset(mtcars, one_level_tree, partitioning = "cyl")
+#' list.files(one_level_tree, recursive = TRUE)
+#'
+#' # You can also partition by the values in multiple columns
+#' # (here: "cyl" and "gear").
+#' # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
+#' two_levels_tree <- tempdir()
+#' write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
+#' list.files(two_levels_tree, recursive = TRUE)
+#'
+#' # In the two previous examples we would have:
+#' # X = \{4,6,8\}, the number of cylinders.
+#' # Y = \{3,4,5\}, the number of forward gears.
+#' # Z = \{0,1,2\}, the number of saved parts, starting from 0.
+#'
+#' # You can obtain the same result as as the previous examples by combining
+#' # both arrow and dplyr.
+#'
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'
+#'  # Write a structure cyl=X/gear=Y/part-Z.parquet.
+#'  two_levels_tree_2 <- tempfile()
+#'  d %>% write_dataset(two_levels_tree_2)
+#'  list.files(two_levels_tree_2, recursive = TRUE)
+#' }
+#'
+#' # And you can also turn off the Hive-style directory naming where the column
+#' # name is included with the values by using `hive_style = FALSE`.
+#'
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  # Write a structure X/Y/part-Z.parquet.
+#'  two_levels_tree_3 <- tempfile()

Review comment:
       This is minor, but maybe we should name this variable something like 
`two_levels_tree_no_hive` or something so that it's clear that we're not 
expecting it to be the same listing as we saw above when doing `partitioning = 
c("cyl", "gear")` versus `group_by(...) %>% write_dataset(...)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on a change in pull request #9748: ARROW-11729: [R] Add examples to datasets documentation

Reply via email to