[GitHub] [arrow] dgreiss commented on a diff in pull request #36436: GH-36247: [R] Add write_csv_dataset

via GitHub Fri, 04 Aug 2023 11:08:41 -0700


dgreiss commented on code in PR #36436:
URL: https://github.com/apache/arrow/pull/36436#discussion_r1284712932



##########
r/R/dataset-write.R:
##########
@@ -209,6 +228,202 @@ write_dataset <- function(dataset,
   )
 }
 
+#' Write a dataset into partitioned flat files.
+#'
+#' The `write_*_dataset()` are a family of wrappers around [write_dataset] to 
allow for easy switching
+#' between functions for writing datasets.
+#'
+#' @param dataset [Dataset], [RecordBatch], [Table], `arrow_dplyr_query`, or
+#' `data.frame`. If an `arrow_dplyr_query`, the query will be evaluated and
+#' the result will be written. This means that you can `select()`, `filter()`, 
`mutate()`,
+#' etc. to transform the data before it is written if you need to.
+#' @param path String path, URI, or `SubTreeFileSystem` referencing a directory
+#' to write to (directory will be created if it does not exist)
+#' @param partitioning `Partitioning` or a character vector of columns to
+#' use as partition keys (to be written as path segments). Default is to
+#' use the current `group_by()` columns.
+#' @param basename_template String template for the names of files to be 
written.
+#' Must contain `"{i}"`, which will be replaced with an autoincremented
+#' integer to generate basenames of datafiles. For example, `"part-{i}.csv"`
+#' will yield `"part-0.csv", ...`.
+#' If not specified, it defaults to `"part-{i}.csv"`.
+#' @param hive_style Write partition segments as Hive-style
+#' (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is 
`TRUE`.
+#' @param existing_data_behavior The behavior to use when there is already data
+#' in the destination directory.  Must be one of "overwrite", "error", or
+#' "delete_matching".
+#' - `overwrite` (the default) then any new files created will overwrite
+#'   existing files
+#' - `error` then the operation will fail if the destination directory is not
+#'   empty
+#' - `delete_matching` then the writer will delete any existing partitions
+#'   if data is going to be written to those partitions and will leave alone
+#'   partitions which data is not written to.
+#' @param max_partitions Maximum number of partitions any batch may be
+#' written into. Default is 1024L.
+#' @param max_open_files Maximum number of files that can be left opened
+#' during a write operation. If greater than 0 then this will limit the
+#' maximum number of files that can be left open. If an attempt is made to open
+#' too many files then the least recently used file will be closed.
+#' If this setting is set too low you may end up fragmenting your data
+#' into many small files. The default is 900 which also allows some # of files 
to be
+#' open by the scanner before hitting the default Linux limit of 1024.
+#' @param max_rows_per_file Maximum number of rows per file.
+#' If greater than 0 then this will limit how many rows are placed in any 
single file.
+#' Default is 0L.
+#' @param min_rows_per_group Write the row groups to the disk when this number 
of
+#' rows have accumulated. Default is 0L.
+#' @param max_rows_per_group Maximum rows allowed in a single
+#' group and when this number of rows is exceeded, it is split and the next set
+#' of rows is written to the next group. This value must be set such that it is
+#' greater than `min_rows_per_group`. Default is 1024 * 1024.
+#' @param col_names Whether to write an initial header line with column names.
+#' @param batch_size Maximum number of rows processed at a time. Default is 
1024L.
+#' @param delim Delimiter used to separate values. Defaults to `","` for 
`write_delim_dataset()` and
+#' `write_csv_dataset()`, and `"\t` for `write_tsv_dataset()`. Cannot be 
changed for `write_tsv_dataset()`.
+#' @param na a character vector of strings to interpret as missing values. 
Quotes are not allowed in this string.
+#' The default is an empty string `""`.
+#' @param eol the end of line character to use for ending rows. The default is 
`"\n"`.
+#' @param quote How to handle fields which contain characters that need to be 
quoted.
+#' - `Needed` - Only enclose values in quotes which need them, because their 
CSV rendering can
+#'  contain quotes itself (e.g. strings or binary values) (the default)
+#' - `AllValid` -   Enclose all valid values in quotes. Nulls are not quoted. 
May cause readers to
+#' interpret all values as strings if schema is inferred.
+#' - `None` -   Do not enclose any values in quotes. Prevents values from 
containing quotes ("),
+#' cell delimiters (,) or line endings (\\r, \\n), (following RFC4180). If 
values
+#' contain these characters, an error is caused when attempting to write.
+#' @return The input `dataset`, invisibly.
+#'
+#' @seealso [write_dataset()]
+#' @export
+write_delim_dataset <- function(dataset,
+                                path,
+                                partitioning = dplyr::group_vars(dataset),
+                                basename_template = "part-{i}.txt",
+                                hive_style = TRUE,
+                                existing_data_behavior = c("overwrite", 
"error", "delete_matching"),
+                                max_partitions = 1024L,
+                                max_open_files = 900L,
+                                max_rows_per_file = 0L,
+                                min_rows_per_group = 0L,
+                                max_rows_per_group = bitwShiftL(1, 20),
+                                col_names = TRUE,
+                                batch_size = 1024L,
+                                delim = ",",
+                                na = "",
+                                eol = "\n",
+                                quote = "Needed") {
+  if (!missing(max_rows_per_file) && missing(max_rows_per_group) && 
max_rows_per_group > max_rows_per_file) {
+    max_rows_per_group <- max_rows_per_file
+  }
+  write_dataset(
+    dataset = dataset,
+    path = path,
+    format = "txt",
+    partitioning = partitioning,
+    basename_template = basename_template,
+    hive_style = hive_style,
+    existing_data_behavior = existing_data_behavior,
+    max_partitions = max_partitions,
+    max_open_files = max_open_files,
+    max_rows_per_file = max_rows_per_file,
+    min_rows_per_group = min_rows_per_group,
+    max_rows_per_group = max_rows_per_group,
+    include_header = col_names,
+    batch_size = batch_size,
+    delimiter = delim,
+    null_string = na,
+    eol = eol,
+    quoting_style = quote
+  )
+}
+
+#' @rdname write_delim_dataset
+#' @export
+write_csv_dataset <- function(dataset,
+                              path,
+                              partitioning = dplyr::group_vars(dataset),
+                              basename_template = "part-{i}.csv",
+                              hive_style = TRUE,
+                              existing_data_behavior = c("overwrite", "error", 
"delete_matching"),
+                              max_partitions = 1024L,
+                              max_open_files = 900L,
+                              max_rows_per_file = 0L,
+                              min_rows_per_group = 0L,
+                              max_rows_per_group = bitwShiftL(1, 20),
+                              col_names = TRUE,
+                              batch_size = 1024L,
+                              delim = ",",
+                              na = "",
+                              eol = "\n",
+                              quote = "Needed") {
+  if (!missing(max_rows_per_file) && missing(max_rows_per_group) && 
max_rows_per_group > max_rows_per_file) {
+    max_rows_per_group <- max_rows_per_file
+  }
+  write_dataset(
+    dataset = dataset,
+    path = path,
+    format = "csv",
+    partitioning = partitioning,
+    basename_template = basename_template,
+    hive_style = hive_style,
+    existing_data_behavior = existing_data_behavior,
+    max_partitions = max_partitions,
+    max_open_files = max_open_files,
+    max_rows_per_file = max_rows_per_file,
+    min_rows_per_group = min_rows_per_group,
+    max_rows_per_group = max_rows_per_group,
+    include_header = col_names,
+    batch_size = batch_size,
+    delimiter = delim,
+    null_string = na,
+    eol = eol,
+    quoting_style = quote
+  )
+}
+
+#' @rdname write_delim_dataset
+#' @export
+write_tsv_dataset <- function(dataset,
+                              path,
+                              partitioning = dplyr::group_vars(dataset),
+                              basename_template = "part-{i}.tsv",
+                              hive_style = TRUE,
+                              existing_data_behavior = c("overwrite", "error", 
"delete_matching"),
+                              max_partitions = 1024L,
+                              max_open_files = 900L,
+                              max_rows_per_file = 0L,
+                              min_rows_per_group = 0L,
+                              max_rows_per_group = bitwShiftL(1, 20),
+                              col_names = TRUE,
+                              batch_size = 1024L,
+                              na = "",
+                              eol = "\n",
+                              quote = "Needed") {
+  if (!missing(max_rows_per_file) && missing(max_rows_per_group) && 
max_rows_per_group > max_rows_per_file) {
+    max_rows_per_group <- max_rows_per_file
+  }

Review Comment:
   Otherwise this throws an error: 
   
   ```
   write_delim_dataset(ds, dst_dir, max_rows_per_file = 5L)
   
   > Error: Invalid: max_rows_per_group must be less than or equal to 
max_rows_per_file
   ```
   
   This check gets duplicated on all the `write_*_dataset()` functions, so 
there may be a way to refactor, but didn't think it was worth the indirection. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] dgreiss commented on a diff in pull request #36436: GH-36247: [R] Add write_csv_dataset

Reply via email to