Brandon Bertelsen created ARROW-14856: -----------------------------------------
Summary: [R] group by n() on partitioning variables counts files not rows Key: ARROW-14856 URL: https://issues.apache.org/jira/browse/ARROW-14856 Project: Apache Arrow Issue Type: Bug Reporter: Brandon Bertelsen It appears that when grouping by a partitioning variable, summarizy/tally, n() methods now count the number of files in a group rather than the number of rows. Using R package from CRAN 6.0.0.2 {code:java} library(arrow) library(dplyr) set.seed(42) df <- data.frame(a = sample(1:1e6)) df$letters <- sample(letters, replace = T, 1e6) write_dataset(df, path = "test", partitioning = "letters", hive_style = FALSE) r <- read_parquet("test/a/part-0.parquet") nrow(r) # 38389 ds <- open_dataset("test", partitioning = 'letters') ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect() # # A tibble: 26 × 2 # letters n # <chr> <int> # 1 c 1 # 2 p 1 # 3 a 1 # 4 b 1 # 5 e 1 # 6 q 1 # 7 d 1 # 8 g 1 # 9 r 1 # 10 h 1 # # … with 16 more rows file.copy("test/a/part-0.parquet", "test/a/part-1.parquet") ds <- open_dataset("test", partitioning = 'letters') ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect() %>% arrange(-n) # # A tibble: 26 × 2 # letters n # <chr> <int> # 1 a 2 # 2 d 1 # 3 f 1 # 4 x 1 # 5 c 1 # 6 b 1 # 7 e 1 # 8 g 1 # 9 u 1 # 10 k 1 # # … with 16 more rows # What about with summarize n = n()? ds %>% select(letters) %>% group_by(letters) %>% summarize(n = n()) %>% collect() %>% arrange(-n) # # A tibble: 26 × 2 # letters n # <chr> <int> # 1 a 2 # 2 b 1 # 3 h 1 # 4 g 1 # 5 c 1 # 6 i 1 # 7 j 1 # 8 s 1 # 9 k 1 # 10 d 1 # # … with 16 more rows {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)