[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

GitBox Fri, 08 Jan 2021 15:15:55 -0800


nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554241301




##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       If I read the code correctly, we're not distinguishing between arguments 
that are valid CsvReadOptions/CsvConvertOptions and just not supported in 
datasets yet (what ARROW-8631 is about) and plain garbage `asdfasdrfwea` 
options. Is that distinction worth making? I.e. is erroring with `Unsupported 
option: na` clear enough that we're not going to get another bug report about 
this?

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       Why does this need to be asserted? Because the validating code assumes 
it?

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = 
csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", 
"skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       You don't have to use it here, but for reference we have an 
`oxford_paste` utility function for making human-readable, appropriately 
comma'd error message strings: 
https://github.com/apache/arrow/blob/master/r/R/util.R#L18

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = 
csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", 
"skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       Could you add a comment here explaining how this catches ambiguous args? 
I don't use `pmatch` enough for this to be immediately apparent. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Reply via email to