[ 
https://issues.apache.org/jira/browse/ARROW-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457280#comment-17457280
 ] 

Jonathan Keane commented on ARROW-15060:
----------------------------------------

Thanks for that David, I was just going to point to the same ticket. I also 
tried locally:

{code}
> library(arrow, warn.conflicts = FALSE)
> 
> dir.create("./dataset")
> readr::write_csv(mtcars, "dataset/data-0.csv.gz")
> readr::write_csv(mtcars, "dataset/data-1.csv.gz")                           
>                                                                             
> ds <- open_dataset("dataset", format = "csv")
> dplyr::collect(ds)
# A tibble: 64 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 54 more rows
{code}

Though it's possible the S3 bit is the issue here?

> [R] open_dataset() on csv files lacks support for compressed files
> ------------------------------------------------------------------
>
>                 Key: ARROW-15060
>                 URL: https://issues.apache.org/jira/browse/ARROW-15060
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Using open_dataset() on S3 buckets of csv files is a game-changing magic, 
> particularly with all the additional support for database / dplyr operations 
> over the remote connection, and the widespread adoption of S3 buckets even by 
> old-school big data providers like NOAA.
>  
> It's not uncommon to encounter buckets with *.csv.gz formats.  I know 
> technically this should be unnecessary, as compression can be done "in 
> flight" by the server, but usually this is not an issue for R users since R's 
> `connection` class automatically detects and gunzips compressed files (over 
> either POSIX or HTTP connections).  It would be really great if arrow could 
> handle this case too. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to