[GitHub] [arrow] westonpace commented on issue #35715: open_dataset() on long vec of URIs uses much more RAM & is much slower than on partition root.

via GitHub Mon, 22 May 2023 18:19:30 -0700


westonpace commented on issue #35715:
URL: https://github.com/apache/arrow/issues/35715#issuecomment-1558318146


   Looks like the problem might be in the R code getting ready to call the 
dataset factory:
   
   ```
   DatasetFactory$create <- function(x,
                                     filesystem = NULL,
                                     format = c("parquet", "arrow", "ipc", 
"feather", "csv", "tsv", "text"),
                                     partitioning = NULL,
                                     hive_style = NA,
                                     factory_options = list(),
                                     ...) {
     if (is_list_of(x, "DatasetFactory")) {
       return(dataset___UnionDatasetFactory__Make(x))
     }
   
     if (is.character(format)) {
       format <- FileFormat$create(match.arg(format), ...)
     } else {
       assert_is(format, "FileFormat")
     }
   
     path_and_fs <- get_paths_and_filesystem(x, filesystem)
     info <- path_and_fs$fs$GetFileInfo(path_and_fs$path)
   
     if (length(info) > 1 || info[[1]]$type == FileType$File) {
       # x looks like a vector of one or more file paths (not a directory path)
       return(FileSystemDatasetFactory$create(
         path_and_fs$fs,
         NULL,
         path_and_fs$path,
         format,
         factory_options = factory_options
       ))
     }
   
     partitioning <- handle_partitioning(partitioning, path_and_fs, hive_style)
     selector <- FileSelector$create(
       path_and_fs$path,
       allow_not_found = FALSE,
       recursive = TRUE
     )
   
     FileSystemDatasetFactory$create(path_and_fs$fs, selector, NULL, format, 
partitioning, factory_options)
   }
   ```
   
   If I understand correctly (which I very well might not), `info <- 
path_and_fs$fs$GetFileInfo(path_and_fs$path)` will call `GetFileInfo` on every 
single path which will trigger an individual S3 ls call for every single path.  
We should probably just assume, if the length of x is greater than 1, then we 
are being given a list of files.
   
   On the bright side, if I use my fix in #35440 then this call (`ds <- 
open_dataset(s3)`) finishes in about 4.5 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #35715: open_dataset() on long vec of URIs uses much more RAM & is much slower than on partition root.

Reply via email to