[ https://issues.apache.org/jira/browse/ARROW-17448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sam Albers updated ARROW-17448: ------------------------------- Description: There are a few issues with the documentation for the cloud storage examples where paths are incorrect. For example in this vignette: [https://arrow.apache.org/docs/r/articles/fs.html] This doesn't work: {code:java} df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")){code} rather it should be: {code:java} df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/part-0.parquet")){code} which I think makes sense as part-0 is the default writing convention for write_dataset and therefore something users are likely to see. Indeed this the way the file structure was written: {code:java} library(arrow) bucket <- s3_bucket("voltrondata-labs-datasets") bucket$ls(path = "nyc-taxi/year=2011", recursive = TRUE) #> [1] "nyc-taxi/year=2011/month=1" #> [2] "nyc-taxi/year=2011/month=1/part-0.parquet" #> [3] "nyc-taxi/year=2011/month=10" #> [4] "nyc-taxi/year=2011/month=10/part-0.parquet" #> [5] "nyc-taxi/year=2011/month=11" #> [6] "nyc-taxi/year=2011/month=11/part-0.parquet" #> [7] "nyc-taxi/year=2011/month=12" #> [8] "nyc-taxi/year=2011/month=12/part-0.parquet" #> [9] "nyc-taxi/year=2011/month=2" #> [10] "nyc-taxi/year=2011/month=2/part-0.parquet" #> [11] "nyc-taxi/year=2011/month=3" #> [12] "nyc-taxi/year=2011/month=3/part-0.parquet" #> [13] "nyc-taxi/year=2011/month=4" #> [14] "nyc-taxi/year=2011/month=4/part-0.parquet" #> [15] "nyc-taxi/year=2011/month=5" #> [16] "nyc-taxi/year=2011/month=5/part-0.parquet" #> [17] "nyc-taxi/year=2011/month=6" #> [18] "nyc-taxi/year=2011/month=6/part-0.parquet" #> [19] "nyc-taxi/year=2011/month=7" #> [20] "nyc-taxi/year=2011/month=7/part-0.parquet" #> [21] "nyc-taxi/year=2011/month=8" #> [22] "nyc-taxi/year=2011/month=8/part-0.parquet" #> [23] "nyc-taxi/year=2011/month=9" #> [24] "nyc-taxi/year=2011/month=9/part-0.parquet" {code} was: There are a few issues with the documentation for the cloud storage examples where paths are incorrect. For example in this vignette: [https://arrow.apache.org/docs/r/articles/fs.html] This doesn't work: {code:java} df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")){code} rather it should be: {code:java} df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/part-0.parquet")){code} which I think makes sense as part-0 is the default writing convention for write_dataset and therefore something users are likely to see. Indeed this the way the file structure was written: {code:java} library(arrow) bucket <- s3_bucket("voltrondata-labs-datasets") bucket$ls(path = "nyc-taxi/year=2011", recursive = TRUE) #> [1] "nyc-taxi/year=2011/month=1" #> [2] "nyc-taxi/year=2011/month=1/part-0.parquet" #> [3] "nyc-taxi/year=2011/month=10" #> [4] "nyc-taxi/year=2011/month=10/part-0.parquet" #> [5] "nyc-taxi/year=2011/month=11" #> [6] "nyc-taxi/year=2011/month=11/part-0.parquet" #> [7] "nyc-taxi/year=2011/month=12" #> [8] "nyc-taxi/year=2011/month=12/part-0.parquet" #> [9] "nyc-taxi/year=2011/month=2" #> [10] "nyc-taxi/year=2011/month=2/part-0.parquet" #> [11] "nyc-taxi/year=2011/month=3" #> [12] "nyc-taxi/year=2011/month=3/part-0.parquet" #> [13] "nyc-taxi/year=2011/month=4" #> [14] "nyc-taxi/year=2011/month=4/part-0.parquet" #> [15] "nyc-taxi/year=2011/month=5" #> [16] "nyc-taxi/year=2011/month=5/part-0.parquet" #> [17] "nyc-taxi/year=2011/month=6" #> [18] "nyc-taxi/year=2011/month=6/part-0.parquet" #> [19] "nyc-taxi/year=2011/month=7" #> [20] "nyc-taxi/year=2011/month=7/part-0.parquet" #> [21] "nyc-taxi/year=2011/month=8" #> [22] "nyc-taxi/year=2011/month=8/part-0.parquet" #> [23] "nyc-taxi/year=2011/month=9" #> [24] "nyc-taxi/year=2011/month=9/part-0.parquet" {code} I also see some examples that need updating in the cookbooks here: [https://arrow.apache.org/cookbook/r/reading-and-writing-data.html#read-a-parquet-file-from-s3] and here: [https://arrow.apache.org/cookbook/py/io.html#reading-partitioned-data-from-s3] > [R][Python] Fix cloud storage paths in some documentation > --------------------------------------------------------- > > Key: ARROW-17448 > URL: https://issues.apache.org/jira/browse/ARROW-17448 > Project: Apache Arrow > Issue Type: Bug > Components: Python, R > Affects Versions: 9.0.0 > Reporter: Sam Albers > Priority: Minor > > There are a few issues with the documentation for the cloud storage examples > where paths are incorrect. For example in this vignette: > [https://arrow.apache.org/docs/r/articles/fs.html] > This doesn't work: > {code:java} > df <- > read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")){code} > rather it should be: > {code:java} > df <- > read_parquet(bucket$path("nyc-taxi/year=2019/month=6/part-0.parquet")){code} > which I think makes sense as part-0 is the default writing convention for > write_dataset and therefore something users are likely to see. Indeed this > the way the file structure was written: > {code:java} > library(arrow) > bucket <- s3_bucket("voltrondata-labs-datasets") > bucket$ls(path = "nyc-taxi/year=2011", recursive = TRUE) > #> [1] "nyc-taxi/year=2011/month=1" > #> [2] "nyc-taxi/year=2011/month=1/part-0.parquet" > #> [3] "nyc-taxi/year=2011/month=10" > #> [4] "nyc-taxi/year=2011/month=10/part-0.parquet" > #> [5] "nyc-taxi/year=2011/month=11" > #> [6] "nyc-taxi/year=2011/month=11/part-0.parquet" > #> [7] "nyc-taxi/year=2011/month=12" > #> [8] "nyc-taxi/year=2011/month=12/part-0.parquet" > #> [9] "nyc-taxi/year=2011/month=2" > #> [10] "nyc-taxi/year=2011/month=2/part-0.parquet" > #> [11] "nyc-taxi/year=2011/month=3" > #> [12] "nyc-taxi/year=2011/month=3/part-0.parquet" > #> [13] "nyc-taxi/year=2011/month=4" > #> [14] "nyc-taxi/year=2011/month=4/part-0.parquet" > #> [15] "nyc-taxi/year=2011/month=5" > #> [16] "nyc-taxi/year=2011/month=5/part-0.parquet" > #> [17] "nyc-taxi/year=2011/month=6" > #> [18] "nyc-taxi/year=2011/month=6/part-0.parquet" > #> [19] "nyc-taxi/year=2011/month=7" > #> [20] "nyc-taxi/year=2011/month=7/part-0.parquet" > #> [21] "nyc-taxi/year=2011/month=8" > #> [22] "nyc-taxi/year=2011/month=8/part-0.parquet" > #> [23] "nyc-taxi/year=2011/month=9" > #> [24] "nyc-taxi/year=2011/month=9/part-0.parquet" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)