westonpace commented on issue #35715:
URL: https://github.com/apache/arrow/issues/35715#issuecomment-1558318146
Looks like the problem might be in the R code getting ready to call the
dataset factory:
```
DatasetFactory$create <- function(x,
filesystem = NULL,
format = c("parquet", "arrow", "ipc",
"feather", "csv", "tsv", "text"),
partitioning = NULL,
hive_style = NA,
factory_options = list(),
...) {
if (is_list_of(x, "DatasetFactory")) {
return(dataset___UnionDatasetFactory__Make(x))
}
if (is.character(format)) {
format <- FileFormat$create(match.arg(format), ...)
} else {
assert_is(format, "FileFormat")
}
path_and_fs <- get_paths_and_filesystem(x, filesystem)
info <- path_and_fs$fs$GetFileInfo(path_and_fs$path)
if (length(info) > 1 || info[[1]]$type == FileType$File) {
# x looks like a vector of one or more file paths (not a directory path)
return(FileSystemDatasetFactory$create(
path_and_fs$fs,
NULL,
path_and_fs$path,
format,
factory_options = factory_options
))
}
partitioning <- handle_partitioning(partitioning, path_and_fs, hive_style)
selector <- FileSelector$create(
path_and_fs$path,
allow_not_found = FALSE,
recursive = TRUE
)
FileSystemDatasetFactory$create(path_and_fs$fs, selector, NULL, format,
partitioning, factory_options)
}
```
If I understand correctly (which I very well might not), `info <-
path_and_fs$fs$GetFileInfo(path_and_fs$path)` will call `GetFileInfo` on every
single path which will trigger an individual S3 ls call for every single path.
We should probably just assume, if the length of x is greater than 1, then we
are being given a list of files.
On the bright side, if I use my fix in #35440 then this call (`ds <-
open_dataset(s3)`) finishes in about 4.5 seconds.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]