[ 
https://issues.apache.org/jira/browse/ARROW-11328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268981#comment-17268981
 ] 

Jonathan Keane commented on ARROW-11328:
----------------------------------------

Thanks for the detailed report, I've got an idea of what's going on here: when 
datasets use {{select()}} that gets translated into a projection with {{NULL}} 
which arrow treats as selecting all columns. At a minimum arrow should error if 
one has selected no columns, but it would be nice to have consistent behavior 
for recordbatches and other dplyr backends. If you're up for it, we would 
welcome a PR.

> [R] Collecting zero columns from a dataset returns entire dataset
> -----------------------------------------------------------------
>
>                 Key: ARROW-11328
>                 URL: https://issues.apache.org/jira/browse/ARROW-11328
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 2.0.0, 2.0.1
>            Reporter: András Svraka
>            Assignee: Jonathan Keane
>            Priority: Major
>
> Collecting a dataset with zero selected columns returns all columns of the 
> dataset in a data frame without column names.
> {code:r}
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> tmp <- tempfile()
> write_dataset(mtcars, tmp, format = "parquet")
> open_dataset(tmp) %>% select() %>% collect()
> #>                                             
> #> 1  21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
> #> 2  21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
> #> 3  22.8 4 108.0  93 3.85 2.320 18.61 1 1 4 1
> #> 4  21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
> #> 5  18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
> #> 6  18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
> #> 7  14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
> #> 8  24.4 4 146.7  62 3.69 3.190 20.00 1 0 4 2
> #> 9  22.8 4 140.8  95 3.92 3.150 22.90 1 0 4 2
> #> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
> #> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
> #> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
> #> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
> #> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
> #> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
> #> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
> #> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
> #> 18 32.4 4  78.7  66 4.08 2.200 19.47 1 1 4 1
> #> 19 30.4 4  75.7  52 4.93 1.615 18.52 1 1 4 2
> #> 20 33.9 4  71.1  65 4.22 1.835 19.90 1 1 4 1
> #> 21 21.5 4 120.1  97 3.70 2.465 20.01 1 0 3 1
> #> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
> #> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
> #> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
> #> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
> #> 26 27.3 4  79.0  66 4.08 1.935 18.90 1 1 4 1
> #> 27 26.0 4 120.3  91 4.43 2.140 16.70 0 1 5 2
> #> 28 30.4 4  95.1 113 3.77 1.513 16.90 1 1 5 2
> #> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
> #> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
> #> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
> #> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
> {code}
> Empty selections in dplyr return data frames with zero columns and based on 
> test cases covering [dplyr 
> verbs|https://github.com/apache/arrow/blob/dfee3917dc011e184264187f505da1de3d1d6fbb/r/tests/testthat/test-dplyr.R#L413-L425]
>  on RecordBatches already handle empty selections in the same way.
> Created on 2021-01-20 by the [reprex package|https://reprex.tidyverse.org] 
> \(v0.3.0)
> Session info
> {code:r}
> devtools::session_info()
> #> ─ Session info 
> ───────────────────────────────────────────────────────────────
> #>  setting  value                       
> #>  version  R version 4.0.3 (2020-10-10)
> #>  os       Ubuntu 20.04.1 LTS          
> #>  system   x86_64, linux-gnu           
> #>  ui       X11                         
> #>  language (EN)                        
> #>  collate  en_US.UTF-8                 
> #>  ctype    en_US.UTF-8                 
> #>  tz       Etc/UTC                     
> #>  date     2021-01-20                  
> #> 
> #> - Packages 
> -------------------------------------------------------------------
> #>  package     * version        date       lib source        
> #>  arrow       * 2.0.0.20210119 2021-01-20 [1] local         
> #>  assertthat    0.2.1          2019-03-21 [1] RSPM (R 4.0.0)
> #>  bit           4.0.4          2020-08-04 [1] RSPM (R 4.0.2)
> #>  bit64         4.0.5          2020-08-30 [1] RSPM (R 4.0.2)
> #>  callr         3.5.1          2020-10-13 [1] RSPM (R 4.0.2)
> #>  cli           2.2.0          2020-11-20 [1] CRAN (R 4.0.3)
> #>  crayon        1.3.4          2017-09-16 [1] RSPM (R 4.0.0)
> #>  DBI           1.1.1          2021-01-15 [1] CRAN (R 4.0.3)
> #>  desc          1.2.0          2018-05-01 [1] RSPM (R 4.0.0)
> #>  devtools      2.3.2          2020-09-18 [1] RSPM (R 4.0.2)
> #>  digest        0.6.27         2020-10-24 [1] RSPM (R 4.0.3)
> #>  dplyr       * 1.0.3          2021-01-15 [1] CRAN (R 4.0.3)
> #>  ellipsis      0.3.1          2020-05-15 [1] RSPM (R 4.0.0)
> #>  evaluate      0.14           2019-05-28 [1] RSPM (R 4.0.0)
> #>  fansi         0.4.2          2021-01-15 [1] CRAN (R 4.0.3)
> #>  fs            1.5.0          2020-07-31 [1] RSPM (R 4.0.2)
> #>  generics      0.1.0          2020-10-31 [1] CRAN (R 4.0.3)
> #>  glue          1.4.2          2020-08-27 [1] RSPM (R 4.0.2)
> #>  highr         0.8            2019-03-20 [1] RSPM (R 4.0.0)
> #>  htmltools     0.5.1          2021-01-12 [1] RSPM (R 4.0.3)
> #>  knitr         1.30           2020-09-22 [1] CRAN (R 4.0.2)
> #>  lifecycle     0.2.0          2020-03-06 [1] RSPM (R 4.0.0)
> #>  magrittr      2.0.1          2020-11-17 [1] RSPM (R 4.0.3)
> #>  memoise       1.1.0          2017-04-21 [1] RSPM (R 4.0.0)
> #>  pillar        1.4.7          2020-11-20 [1] CRAN (R 4.0.3)
> #>  pkgbuild      1.2.0          2020-12-15 [1] RSPM (R 4.0.3)
> #>  pkgconfig     2.0.3          2019-09-22 [1] RSPM (R 4.0.0)
> #>  pkgload       1.1.0          2020-05-29 [1] RSPM (R 4.0.0)
> #>  prettyunits   1.1.1          2020-01-24 [1] RSPM (R 4.0.0)
> #>  processx      3.4.5          2020-11-30 [1] RSPM (R 4.0.3)
> #>  ps            1.5.0          2020-12-05 [1] CRAN (R 4.0.3)
> #>  purrr         0.3.4          2020-04-17 [1] RSPM (R 4.0.0)
> #>  R6            2.5.0          2020-10-28 [1] RSPM (R 4.0.3)
> #>  remotes       2.2.0          2020-07-21 [1] RSPM (R 4.0.2)
> #>  rlang         0.4.10         2020-12-30 [1] CRAN (R 4.0.3)
> #>  rmarkdown     2.6            2020-12-14 [1] RSPM (R 4.0.3)
> #>  rprojroot     2.0.2          2020-11-15 [1] RSPM (R 4.0.3)
> #>  sessioninfo   1.1.1          2018-11-05 [1] RSPM (R 4.0.0)
> #>  stringi       1.5.3          2020-09-09 [1] RSPM (R 4.0.2)
> #>  stringr       1.4.0          2019-02-10 [1] RSPM (R 4.0.0)
> #>  testthat      3.0.1          2020-12-17 [1] RSPM (R 4.0.3)
> #>  tibble        3.0.5          2021-01-15 [1] CRAN (R 4.0.3)
> #>  tidyselect    1.1.0          2020-05-11 [1] RSPM (R 4.0.0)
> #>  usethis       2.0.0          2020-12-10 [1] RSPM (R 4.0.3)
> #>  vctrs         0.3.6          2020-12-17 [1] RSPM (R 4.0.3)
> #>  withr         2.4.0          2021-01-16 [1] CRAN (R 4.0.3)
> #>  xfun          0.20           2021-01-06 [1] RSPM (R 4.0.3)
> #>  yaml          2.2.1          2020-02-01 [1] RSPM (R 4.0.0)
> #> 
> #> [1] /usr/local/lib/R/site-library
> #> [2] /usr/lib/R/site-library
> #> [3] /usr/lib/R/library
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to