[ 
https://issues.apache.org/jira/browse/ARROW-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502261#comment-17502261
 ] 

Dewey Dunnington commented on ARROW-15856:
------------------------------------------

Thanks for reporting this! Without a reproducible example it's difficult to 
know exactly what's happening. I've prepared one below that seems to work but 
might not reflect your exact case...perhaps you could modify the below reprex 
to see if you can replicate the failure?

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

# create a directory
dir <- tempfile()
dir.create(dir)
bucket_local <- file.path(dir, "bucket_name")
dir.create(bucket_local)

# start minio in the backround (you can do this from the terminal too)
minio_server <- processx::process$new("minio", args = c("server", dir), 
supervise = TRUE)
Sys.sleep(1)
stopifnot(minio_server$is_alive())

# make sure we can connect
s3_uri <- 
"s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000"
bucket <- s3_bucket(s3_uri)

# write a dataset
data <- expand.grid(
  letter = letters[1:5],
  number = 1:10
)

data %>% 
  group_by(letter) %>% 
  write_dataset(
    path = file.path(bucket_local, "dataset-test"),
    format = "csv"
  )

bucket$ls("bucket_name/dataset-test", recursive = TRUE)
#>  [1] "bucket_name/dataset-test/letter=a"           
#>  [2] "bucket_name/dataset-test/letter=a/part-0.csv"
#>  [3] "bucket_name/dataset-test/letter=b"           
#>  [4] "bucket_name/dataset-test/letter=b/part-0.csv"
#>  [5] "bucket_name/dataset-test/letter=c"           
#>  [6] "bucket_name/dataset-test/letter=c/part-0.csv"
#>  [7] "bucket_name/dataset-test/letter=d"           
#>  [8] "bucket_name/dataset-test/letter=d/part-0.csv"
#>  [9] "bucket_name/dataset-test/letter=e"           
#> [10] "bucket_name/dataset-test/letter=e/part-0.csv"

# make sure open dataset locally works
open_dataset(file.path(bucket_local, "dataset-test"), format = "csv") %>% 
  collect() %>% 
  head()
#> # A tibble: 6 × 2
#>   number letter
#>    <int> <chr> 
#> 1      1 a     
#> 2      2 a     
#> 3      3 a     
#> 4      4 a     
#> 5      5 a     
#> 6      6 a

# make sure open dataset with remote fs works
open_dataset(bucket$path("bucket_name/dataset-test"), format = "csv") %>% 
  collect() %>% 
  head()
#> # A tibble: 6 × 2
#>   number letter
#>    <int> <chr> 
#> 1      1 a     
#> 2      2 a     
#> 3      3 a     
#> 4      4 a     
#> 5      5 a     
#> 6      6 a


# shut down minio
minio_server$interrupt()
#> [1] TRUE
Sys.sleep(1)
stopifnot(!minio_server$is_alive())
{code}


> [R] S3FileSystem - open_dataset
> -------------------------------
>
>                 Key: ARROW-15856
>                 URL: https://issues.apache.org/jira/browse/ARROW-15856
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Martin du Toit
>            Priority: Major
>
> Hi
>  I can successfully create a S3FileSystem that connects via minio. 
> I can create a SubTreeFileSystem: 
> s3://investmentaccountingdata/rawdata/transactions/transactions-xxx/v1.1/
> I can list the files in the SubTreeFileSystem, and I can open a dataset on 
> from the list of files
> {code:java}
> // code placeholder
> list_files <- sfs$ls(recursive=TRUE)
> ds <- arrow::open_dataset(sources = list_files, schema = schema_file, format 
> = csv_format, filesystem = sfs)
> {code}
> This all works fine, if I provide the list of files, but I want to specify a 
> path higher up to be able to include the sub folders as partitions. The code 
> I use works perfectly if I run it on a local disk.
> How can I do open_dataset, and give a folder as source?
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to