PMassicotte opened a new issue, #40426: URL: https://github.com/apache/arrow/issues/40426
### Describe the bug, including details regarding any error messages, version, and platform. I am trying to use `open_dataset()` and it works fine if I use `anonymous = TRUE`. ``` r library(tidyverse) library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:lubridate': #> #> duration #> The following object is masked from 'package:utils': #> #> timestamp bb <- s3_bucket( bucket = "cdoc", endpoint_override = "s3.valeria.science", anonymous = TRUE ) ds <- open_dataset(bb) ds #> FileSystemDataset with 130 Parquet files #> river: string #> date: date32[day] #> temperature: double #> doc: double #> wavelength: double #> absorption: double #> longitude: double #> latitude: double #> unique_id: string #> study_id: string #> ecosystem: string #> t: double #> sample: string #> Season: string #> depth: double #> salinity: double #> secchi: double #> file_name: string #> location_id: string #> station_name: string #> time: int32 #> do: double #> bag: double #> type: string #> treatment: string #> cruise: string #> station: string #> bottomdepth: double #> niskin: int32 #> pressure: int32 #> id: string #> ph: double #> stream: string #> site: string #> month: double #> DOC_FLAG: double #> wbic: int32 #> record_id: int32 #> groupid: int32 #> cuvette: int32 #> elevation: double #> elevation_ned: double #> depth_ned: double #> perimeter: double #> depth_position: string ds |> summarise(mean_doc = mean(doc, na.rm = TRUE), .by = ecosystem, n = n()) |> collect() #> # A tibble: 5 × 3 #> ecosystem mean_doc n #> <chr> <dbl> <int> #> 1 coastal 253. 581164 #> 2 ocean 60.2 1391192 #> 3 river 527. 904279 #> 4 lake 1323. 140691 #> 5 estuary 235. 296026 ``` However, if I do not set `anonymous = TRUE` the execution stalls and I have no feedback on what is going on. ``` r bb <- s3_bucket( bucket = "cdoc", endpoint_override = "s3.valeria.science" ) ds <- open_dataset(bb) ds #> FileSystemDataset with 130 Parquet files #> river: string #> date: date32[day] #> temperature: double #> doc: double #> wavelength: double #> absorption: double #> longitude: double #> latitude: double #> unique_id: string #> study_id: string #> ecosystem: string #> t: double #> sample: string #> Season: string #> depth: double #> salinity: double #> secchi: double #> file_name: string #> location_id: string #> station_name: string #> time: int32 #> do: double #> bag: double #> type: string #> treatment: string #> cruise: string #> station: string #> bottomdepth: double #> niskin: int32 #> pressure: int32 #> id: string #> ph: double #> stream: string #> site: string #> month: double #> DOC_FLAG: double #> wbic: int32 #> record_id: int32 #> groupid: int32 #> cuvette: int32 #> elevation: double #> elevation_ned: double #> depth_ned: double #> perimeter: double #> depth_position: string ``` What is strange to me is that we have something in `ds`. However, I can not do any operations. The code below will stall. ``` r # ds |> # summarise(mean_doc = mean(doc, na.rm = TRUE), .by = ecosystem, n = n()) |> # collect() ``` <sup>Created on 2024-03-08 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup> <details style="margin-bottom:10px;"> <summary> Session info </summary> ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.3 (2024-02-29) #> os Linux Mint 21.3 #> system x86_64, linux-gnu #> ui X11 #> language en_CA:en #> collate en_CA.UTF-8 #> ctype en_CA.UTF-8 #> tz America/Montreal #> date 2024-03-08 #> pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow * 14.0.2.1 2024-02-23 [1] RSPM (R 4.3.0) #> assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.3.0) #> bit 4.0.5 2022-11-15 [1] RSPM (R 4.3.0) #> bit64 4.0.5 2020-08-30 [1] RSPM (R 4.3.0) #> cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) #> colorspace 2.1-0 2023-01-23 [1] RSPM (R 4.3.0) #> digest 0.6.34 2024-01-11 [1] RSPM (R 4.3.0) #> dplyr * 1.1.4 2023-11-17 [1] RSPM (R 4.3.0) #> evaluate 0.23 2023-11-01 [1] RSPM (R 4.3.0) #> fansi 1.0.6 2023-12-08 [1] RSPM (R 4.3.0) #> fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0) #> forcats * 1.0.0 2023-01-29 [1] RSPM (R 4.3.0) #> fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) #> generics 0.1.3 2022-07-05 [1] RSPM (R 4.3.0) #> ggplot2 * 3.5.0 2024-02-23 [1] CRAN (R 4.3.2) #> glue 1.7.0 2024-01-09 [1] RSPM (R 4.3.0) #> gtable 0.3.4 2023-08-21 [1] RSPM (R 4.3.0) #> hms 1.1.3 2023-03-21 [1] RSPM (R 4.3.0) #> htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) #> knitr 1.45 2023-10-30 [1] RSPM (R 4.3.0) #> lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) #> lubridate * 1.9.3 2023-09-27 [1] RSPM (R 4.3.0) #> magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) #> munsell 0.5.0 2018-06-12 [1] RSPM (R 4.3.0) #> pillar 1.9.0 2023-03-22 [1] RSPM (R 4.3.0) #> pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.3.0) #> purrr * 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) #> R.cache 0.16.0 2022-07-21 [1] RSPM (R 4.3.0) #> R.methodsS3 1.8.2 2022-06-13 [1] RSPM (R 4.3.0) #> R.oo 1.26.0 2024-01-24 [1] CRAN (R 4.3.2) #> R.utils 2.12.3 2023-11-18 [1] RSPM (R 4.3.0) #> R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0) #> readr * 2.1.5 2024-01-10 [1] RSPM (R 4.3.0) #> reprex 2.1.0 2024-01-11 [1] RSPM (R 4.3.0) #> rlang 1.1.3 2024-01-10 [1] RSPM (R 4.3.0) #> rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.3.3) #> rstudioapi 0.15.0 2023-07-07 [1] RSPM (R 4.3.0) #> scales 1.3.0 2023-11-28 [1] RSPM (R 4.3.0) #> sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) #> stringi 1.8.3 2023-12-11 [1] RSPM (R 4.3.0) #> stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.2) #> styler 1.10.2 2023-08-29 [1] RSPM (R 4.3.0) #> tibble * 3.2.1 2023-03-20 [1] RSPM (R 4.3.0) #> tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.2) #> tidyselect 1.2.0 2022-10-10 [1] RSPM (R 4.3.0) #> tidyverse * 2.0.0 2023-02-22 [1] RSPM (R 4.3.0) #> timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.2) #> tzdb 0.4.0 2023-05-12 [1] RSPM (R 4.3.0) #> utf8 1.2.4 2023-10-22 [1] RSPM (R 4.3.0) #> vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.2) #> xfun 0.42 2024-02-08 [1] RSPM (R 4.3.0) #> yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) #> #> [1] /home/filoche/R/x86_64-pc-linux-gnu-library/4.3 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ``` </details> ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org