[ https://issues.apache.org/jira/browse/ARROW-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457280#comment-17457280 ]
Jonathan Keane commented on ARROW-15060: ---------------------------------------- Thanks for that David, I was just going to point to the same ticket. I also tried locally: {code} > library(arrow, warn.conflicts = FALSE) > > dir.create("./dataset") > readr::write_csv(mtcars, "dataset/data-0.csv.gz") > readr::write_csv(mtcars, "dataset/data-1.csv.gz") > > ds <- open_dataset("dataset", format = "csv") > dplyr::collect(ds) # A tibble: 64 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 # … with 54 more rows {code} Though it's possible the S3 bit is the issue here? > [R] open_dataset() on csv files lacks support for compressed files > ------------------------------------------------------------------ > > Key: ARROW-15060 > URL: https://issues.apache.org/jira/browse/ARROW-15060 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Carl Boettiger > Priority: Major > > Using open_dataset() on S3 buckets of csv files is a game-changing magic, > particularly with all the additional support for database / dplyr operations > over the remote connection, and the widespread adoption of S3 buckets even by > old-school big data providers like NOAA. > > It's not uncommon to encounter buckets with *.csv.gz formats. I know > technically this should be unnecessary, as compression can be done "in > flight" by the server, but usually this is not an issue for R users since R's > `connection` class automatically detects and gunzips compressed files (over > either POSIX or HTTP connections). It would be really great if arrow could > handle this case too. -- This message was sent by Atlassian Jira (v8.20.1#820001)