r2evans opened a new issue, #45373:
URL: https://github.com/apache/arrow/issues/45373
### Describe the bug, including details regarding any error messages,
version, and platform.
I think there's a bug in when
If there's an `arrange(.)` in the lazy pipeline that is followed by some
aggregation with `summarize`, the collection still looks for the sorting column:
```r
library(arrow)
library(dplyr)
arrow_table(mtcars) |>
summarize(across(mpg, list(Min = min, Max = max))) |>
collect()
# # A tibble: 1 × 2
# mpg_Min mpg_Max
# <dbl> <dbl>
# 1 10.4 33.9
arrow_table(mtcars) |>
arrange(mpg) |>
summarize(across(mpg, list(Min = min, Max = max))) |>
collect()
# Error in compute.arrow_dplyr_query(x) :
# Invalid: Invalid sort key column: No match for FieldRef.Name(mpg) in
mpg_Min: double
# mpg_Max: double
# ----
# mpg_Min:
# [
# [
# 10.4
# ]
# ]
# mpg_Max:
# [
# [
# 33.9
# ]
# ]
```
This example is somewhat contrived _here_, in that this summarization does
not need ordered data. The underlying issue remains: why does it not sort the
data _at that point_ and then summarize? I'm not certain if this is a problem
with lazy sorting or if it is too aggressive preserving the sort-field(s).
This behavior is in contrast to a `select`ion removing the sorting column:
```r
arrow_table(mtcars) |>
arrange(mpg) |>
select(-mpg) |>
collect()
# # A tibble: 32 × 10
# cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 8 472 205 2.93 5.25 18.0 0 0 3 4
# 2 8 460 215 3 5.42 17.8 0 0 3 4
# 3 8 350 245 3.73 3.84 15.4 0 0 3 4
# 4 8 360 245 3.21 3.57 15.8 0 0 3 4
# 5 8 440 230 3.23 5.34 17.4 0 0 3 4
# 6 8 301 335 3.54 3.57 14.6 0 1 5 8
# 7 8 276. 180 3.07 3.78 18 0 0 3 3
# 8 8 304 150 3.15 3.44 17.3 0 0 3 2
# 9 8 318 150 2.76 3.52 16.9 0 0 3 2
# 10 8 351 264 4.22 3.17 14.5 0 1 5 4
# # ℹ 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
```
<details>
<summary> <code>> sessionInfo()</code> </summary>
```r
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.2
Matrix products: default
BLAS:
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;
LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_18.1.0.1 dplyr_1.1.4
loaded via a namespace (and not attached):
[1] assertthat_0.2.1 utf8_1.2.4 R6_2.5.1 bit_4.5.0.1
tidyselect_1.2.1 magrittr_2.0.3 glue_1.8.0 tibble_3.2.1
pkgconfig_2.0.3 bit64_4.5.2
[11] generics_0.1.3 lifecycle_1.0.4 cli_3.6.3 fansi_1.0.6
vctrs_0.6.5 withr_3.0.2 compiler_4.4.2 purrr_1.0.2
pillar_1.9.0 rlang_1.1.4
```
</details>
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]