nealrichardson commented on a change in pull request #9800:
URL: https://github.com/apache/arrow/pull/9800#discussion_r607385332



##########
File path: r/R/dataset.R
##########
@@ -64,6 +73,8 @@ open_dataset <- function(sources,
                          partitioning = hive_partition(),
                          unify_schemas = NULL,
                          ...) {
+  # TODO: default to unify_schemas = TRUE when `sources` is a vector of file 
paths/URIs?

Review comment:
       Why would we have a different default for that case?

##########
File path: r/R/dataset.R
##########
@@ -23,17 +23,23 @@
 #' `open_dataset()` to point to a directory of data files and return a
 #' `Dataset`, then use `dplyr` methods to query it.
 #'
-#' @param sources Either:
-#'   * a string path to a directory containing data files
+#' @param sources One of:
+#'   * a string path or URI to a directory containing data files
+#'   * a vector of one or more string paths or URIs to data files

Review comment:
       While "vector" is accurate in the R sense of length one, I think it's 
good to make explicit that this could be a single file, even if it is 
technically redundant.
   
   ```suggestion
   #'   * a string path or URI to a single file
   #'   * a character vector of paths or URIs to individual data files
   ```

##########
File path: r/R/dataset-factory.R
##########
@@ -45,15 +45,20 @@ DatasetFactory$create <- function(x,
     return(dataset___UnionDatasetFactory__Make(x))
   }
 
-  path_and_fs <- get_path_and_filesystem(x, filesystem)
-  selector <- FileSelector$create(path_and_fs$path, allow_not_found = FALSE, 
recursive = TRUE)
-
   if (is.character(format)) {
     format <- FileFormat$create(match.arg(format), ...)
   } else {
     assert_is(format, "FileFormat")
   }
 
+  path_and_fs <- get_paths_and_filesystem(x, filesystem)
+  info <- path_and_fs$fs$GetFileInfo(path_and_fs$path)
+
+  if (length(info) > 1 || info[[1]]$type == FileType$File) {
+    # x looks like a vector of one or more file paths (not a directory path)
+    return(FileSystemDatasetFactory$create(path_and_fs$fs, NULL, 
path_and_fs$path, format))
+  }

Review comment:
       Probably YAGNI

##########
File path: r/R/filesystem.R
##########
@@ -273,33 +273,56 @@ FileSystem$from_uri <- function(uri) {
   fs___FileSystemFromUri(uri)
 }
 
-get_path_and_filesystem <- function(x, filesystem = NULL) {
+get_paths_and_filesystem <- function(x, filesystem = NULL) {
   # Wrapper around FileSystem$from_uri that handles local paths
   # and an optional explicit filesystem
   if (inherits(x, "SubTreeFileSystem")) {
     return(list(fs = x$base_fs, path = x$base_path))
   }
-  assert_that(is.string(x))
-  if (is_url(x)) {
+  assert_that(is.character(x))
+  are_urls <- are_urls(x)
+  if (any(are_urls)) {
+    if (!all(are_urls)) {
+      stop(
+        "Vectors of paths and URIs for different file systems are not 
supported",
+        call. = FALSE
+      )
+    }
     if (!is.null(filesystem)) {
       # Stop? Can't have URL (which yields a fs) and another fs
     }
-    FileSystem$from_uri(x)
+    # TODO: do this more efficiently?
+    x <- lapply(x, FileSystem$from_uri)

Review comment:
       I'm not sure if it matters; in practice I'm most keen to support the 
single-file use case, in which this is moot

##########
File path: r/R/filesystem.R
##########
@@ -273,33 +273,53 @@ FileSystem$from_uri <- function(uri) {
   fs___FileSystemFromUri(uri)
 }
 
-get_path_and_filesystem <- function(x, filesystem = NULL) {
+get_paths_and_filesystem <- function(x, filesystem = NULL) {

Review comment:
       What if you left `get_path_and_filesystem()` (singular) as it was and 
just lapply over the vector of paths where you need multiple?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to