PMassicotte opened a new issue, #40426:
URL: https://github.com/apache/arrow/issues/40426

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I am trying to use `open_dataset()` and it works fine if I use `anonymous = 
TRUE`.
   
   ``` r
   library(tidyverse)
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:lubridate':
   #> 
   #>     duration
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   
   bb <- s3_bucket(
     bucket = "cdoc",
     endpoint_override = "s3.valeria.science",
     anonymous = TRUE
   )
   
   ds <- open_dataset(bb)
   
   ds
   #> FileSystemDataset with 130 Parquet files
   #> river: string
   #> date: date32[day]
   #> temperature: double
   #> doc: double
   #> wavelength: double
   #> absorption: double
   #> longitude: double
   #> latitude: double
   #> unique_id: string
   #> study_id: string
   #> ecosystem: string
   #> t: double
   #> sample: string
   #> Season: string
   #> depth: double
   #> salinity: double
   #> secchi: double
   #> file_name: string
   #> location_id: string
   #> station_name: string
   #> time: int32
   #> do: double
   #> bag: double
   #> type: string
   #> treatment: string
   #> cruise: string
   #> station: string
   #> bottomdepth: double
   #> niskin: int32
   #> pressure: int32
   #> id: string
   #> ph: double
   #> stream: string
   #> site: string
   #> month: double
   #> DOC_FLAG: double
   #> wbic: int32
   #> record_id: int32
   #> groupid: int32
   #> cuvette: int32
   #> elevation: double
   #> elevation_ned: double
   #> depth_ned: double
   #> perimeter: double
   #> depth_position: string
   
   ds |>
     summarise(mean_doc = mean(doc, na.rm = TRUE), .by = ecosystem, n = n()) |>
     collect()
   #> # A tibble: 5 × 3
   #>   ecosystem mean_doc       n
   #>   <chr>        <dbl>   <int>
   #> 1 coastal      253.   581164
   #> 2 ocean         60.2 1391192
   #> 3 river        527.   904279
   #> 4 lake        1323.   140691
   #> 5 estuary      235.   296026
   ```
   
   However, if I do not set `anonymous = TRUE` the execution stalls and I have 
no feedback on what is going on.
   
   ``` r
   bb <- s3_bucket(
     bucket = "cdoc",
     endpoint_override = "s3.valeria.science"
   )
   
   ds <- open_dataset(bb)
   
   ds
   #> FileSystemDataset with 130 Parquet files
   #> river: string
   #> date: date32[day]
   #> temperature: double
   #> doc: double
   #> wavelength: double
   #> absorption: double
   #> longitude: double
   #> latitude: double
   #> unique_id: string
   #> study_id: string
   #> ecosystem: string
   #> t: double
   #> sample: string
   #> Season: string
   #> depth: double
   #> salinity: double
   #> secchi: double
   #> file_name: string
   #> location_id: string
   #> station_name: string
   #> time: int32
   #> do: double
   #> bag: double
   #> type: string
   #> treatment: string
   #> cruise: string
   #> station: string
   #> bottomdepth: double
   #> niskin: int32
   #> pressure: int32
   #> id: string
   #> ph: double
   #> stream: string
   #> site: string
   #> month: double
   #> DOC_FLAG: double
   #> wbic: int32
   #> record_id: int32
   #> groupid: int32
   #> cuvette: int32
   #> elevation: double
   #> elevation_ned: double
   #> depth_ned: double
   #> perimeter: double
   #> depth_position: string
   ```
   
   What is strange to me is that we have something in `ds`. However, I can not 
do any operations. The code below will stall.
   
   ``` r
   # ds |>
   #  summarise(mean_doc = mean(doc, na.rm = TRUE), .by = ecosystem, n = n()) |>
   #  collect()
   ```
   
   <sup>Created on 2024-03-08 with [reprex 
v2.1.0](https://reprex.tidyverse.org)</sup>
   
   <details style="margin-bottom:10px;">
   <summary>
   Session info
   </summary>
   
   ``` r
   sessioninfo::session_info()
   #> ─ Session info 
───────────────────────────────────────────────────────────────
   #>  setting  value
   #>  version  R version 4.3.3 (2024-02-29)
   #>  os       Linux Mint 21.3
   #>  system   x86_64, linux-gnu
   #>  ui       X11
   #>  language en_CA:en
   #>  collate  en_CA.UTF-8
   #>  ctype    en_CA.UTF-8
   #>  tz       America/Montreal
   #>  date     2024-03-08
   #>  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ 
(via rmarkdown)
   #> 
   #> ─ Packages 
───────────────────────────────────────────────────────────────────
   #>  package     * version  date (UTC) lib source
   #>  arrow       * 14.0.2.1 2024-02-23 [1] RSPM (R 4.3.0)
   #>  assertthat    0.2.1    2019-03-21 [1] RSPM (R 4.3.0)
   #>  bit           4.0.5    2022-11-15 [1] RSPM (R 4.3.0)
   #>  bit64         4.0.5    2020-08-30 [1] RSPM (R 4.3.0)
   #>  cli           3.6.2    2023-12-11 [1] RSPM (R 4.3.0)
   #>  colorspace    2.1-0    2023-01-23 [1] RSPM (R 4.3.0)
   #>  digest        0.6.34   2024-01-11 [1] RSPM (R 4.3.0)
   #>  dplyr       * 1.1.4    2023-11-17 [1] RSPM (R 4.3.0)
   #>  evaluate      0.23     2023-11-01 [1] RSPM (R 4.3.0)
   #>  fansi         1.0.6    2023-12-08 [1] RSPM (R 4.3.0)
   #>  fastmap       1.1.1    2023-02-24 [1] RSPM (R 4.3.0)
   #>  forcats     * 1.0.0    2023-01-29 [1] RSPM (R 4.3.0)
   #>  fs            1.6.3    2023-07-20 [1] RSPM (R 4.3.0)
   #>  generics      0.1.3    2022-07-05 [1] RSPM (R 4.3.0)
   #>  ggplot2     * 3.5.0    2024-02-23 [1] CRAN (R 4.3.2)
   #>  glue          1.7.0    2024-01-09 [1] RSPM (R 4.3.0)
   #>  gtable        0.3.4    2023-08-21 [1] RSPM (R 4.3.0)
   #>  hms           1.1.3    2023-03-21 [1] RSPM (R 4.3.0)
   #>  htmltools     0.5.7    2023-11-03 [1] RSPM (R 4.3.0)
   #>  knitr         1.45     2023-10-30 [1] RSPM (R 4.3.0)
   #>  lifecycle     1.0.4    2023-11-07 [1] RSPM (R 4.3.0)
   #>  lubridate   * 1.9.3    2023-09-27 [1] RSPM (R 4.3.0)
   #>  magrittr      2.0.3    2022-03-30 [1] RSPM (R 4.3.0)
   #>  munsell       0.5.0    2018-06-12 [1] RSPM (R 4.3.0)
   #>  pillar        1.9.0    2023-03-22 [1] RSPM (R 4.3.0)
   #>  pkgconfig     2.0.3    2019-09-22 [1] RSPM (R 4.3.0)
   #>  purrr       * 1.0.2    2023-08-10 [1] RSPM (R 4.3.0)
   #>  R.cache       0.16.0   2022-07-21 [1] RSPM (R 4.3.0)
   #>  R.methodsS3   1.8.2    2022-06-13 [1] RSPM (R 4.3.0)
   #>  R.oo          1.26.0   2024-01-24 [1] CRAN (R 4.3.2)
   #>  R.utils       2.12.3   2023-11-18 [1] RSPM (R 4.3.0)
   #>  R6            2.5.1    2021-08-19 [1] RSPM (R 4.3.0)
   #>  readr       * 2.1.5    2024-01-10 [1] RSPM (R 4.3.0)
   #>  reprex        2.1.0    2024-01-11 [1] RSPM (R 4.3.0)
   #>  rlang         1.1.3    2024-01-10 [1] RSPM (R 4.3.0)
   #>  rmarkdown     2.26     2024-03-05 [1] CRAN (R 4.3.3)
   #>  rstudioapi    0.15.0   2023-07-07 [1] RSPM (R 4.3.0)
   #>  scales        1.3.0    2023-11-28 [1] RSPM (R 4.3.0)
   #>  sessioninfo   1.2.2    2021-12-06 [1] RSPM (R 4.3.0)
   #>  stringi       1.8.3    2023-12-11 [1] RSPM (R 4.3.0)
   #>  stringr     * 1.5.1    2023-11-14 [1] CRAN (R 4.3.2)
   #>  styler        1.10.2   2023-08-29 [1] RSPM (R 4.3.0)
   #>  tibble      * 3.2.1    2023-03-20 [1] RSPM (R 4.3.0)
   #>  tidyr       * 1.3.1    2024-01-24 [1] CRAN (R 4.3.2)
   #>  tidyselect    1.2.0    2022-10-10 [1] RSPM (R 4.3.0)
   #>  tidyverse   * 2.0.0    2023-02-22 [1] RSPM (R 4.3.0)
   #>  timechange    0.3.0    2024-01-18 [1] CRAN (R 4.3.2)
   #>  tzdb          0.4.0    2023-05-12 [1] RSPM (R 4.3.0)
   #>  utf8          1.2.4    2023-10-22 [1] RSPM (R 4.3.0)
   #>  vctrs         0.6.5    2023-12-01 [1] RSPM (R 4.3.0)
   #>  withr         3.0.0    2024-01-16 [1] CRAN (R 4.3.2)
   #>  xfun          0.42     2024-02-08 [1] RSPM (R 4.3.0)
   #>  yaml          2.3.8    2023-12-11 [1] RSPM (R 4.3.0)
   #> 
   #>  [1] /home/filoche/R/x86_64-pc-linux-gnu-library/4.3
   #>  [2] /usr/local/lib/R/site-library
   #>  [3] /usr/lib/R/site-library
   #>  [4] /usr/lib/R/library
   #> 
   #> 
──────────────────────────────────────────────────────────────────────────────
   ```
   
   </details>
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to